CN102103612A

CN102103612A - Information extraction method and device

Info

Publication number: CN102103612A
Application number: CN2009102430446A
Authority: CN
Inventors: 林欣欣; 徐剑波; 董宁; 王辉
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Priority date: 2009-12-22
Filing date: 2009-12-22
Publication date: 2011-06-22

Abstract

The embodiment of the invention discloses an information extraction method and device, which relate to the technical field of information extraction. In order to solve the problem in the prior art that the computer automatic indexing adopted cannot extract the preset text block information from the newspaper layout information and manuscript information. An information extraction method provided by an embodiment of the present invention includes: extracting text block information from a layout file, wherein the text block information includes: layout text block information and manuscript text block information; judging the text block information in the text block information whether the preset layout text block information is extracted; if the preset layout text block information is not extracted, extract the preset layout text block information; if the preset layout text block information has been extracted, then Extract the text block information of the default manuscript. By adopting the embodiments of the present invention, the workload of indexing personnel can be reduced, and the accuracy of indexing can be improved.

Description

A method and device for extracting information

技术领域technical field

本发明涉及信息抽取技术领域，尤其涉及一种信息提取方法及装置。The present invention relates to the technical field of information extraction, in particular to an information extraction method and device.

背景技术Background technique

随着互联网和信息技术的快速发展，报纸出版业的数字化工程也竞相开展。在报纸出版业的数字化信息过程中，报纸资源的数字化信息已经成为报社核心的数字资产。所述报纸资源的数字化信息包括：稿件信息，如报纸版面上文章(正文、段落和标题等)、表格中的文字和图片内容等；版面信息，包括报纸版次、版面名称、日期、稿件的位置信息(如坐标信息)、标题及正文的字体、字号等格式信息，文章与图片、图片与文字说明的关联信息等。With the rapid development of the Internet and information technology, digital projects in the newspaper publishing industry are also competing. In the digital information process of the newspaper publishing industry, the digital information of newspaper resources has become the core digital asset of the newspaper office. The digitized information of the newspaper resources includes: manuscript information, such as articles (text, paragraphs and titles, etc.) on the newspaper layout, text and picture content in tables, etc.; layout information, including newspaper edition, layout name, date, manuscript Location information (such as coordinate information), format information such as the font and font size of the title and text, and the association information between articles and pictures, pictures and text descriptions, etc.

为了将所述报纸资源的数字化信息作为历史资料被完整准确的保存下来以备将来查询，或者通过多种数字媒体技术实时准确的进行跨媒体发布，如通过新闻网站、数字报刊和光盘出版等，则可以通过标引软件从报纸的版面信息反解出来版面文件即所述的报纸资源的数字化信息；然后，再将所述反解出来的报纸数字化信息进行标引、修改以及校对。In order to completely and accurately preserve the digitized information of the newspaper resources as historical data for future inquiries, or to publish cross-media in real time and accurately through a variety of digital media technologies, such as publishing through news websites, digital newspapers and CD-ROMs, etc., The layout file, that is, the digitized information of the newspaper resource, can be deciphered from the newspaper layout information by indexing software; then, the decompiled newspaper digitized information can be indexed, modified and proofread.

但是，在实现本发明的过程中，发明人发现现有的技术中至少存在如下问题：现有的技术中所采用的计算机自动标引无法从所述报纸的版面文字块信息和稿件文字块信息中提取出预设文字块信息，例如：校对员名称，版式设计员名称、作者姓名，编辑员名称等数据信息，这样就需要标引员手工进行一一标引，使得标引人员的工作量较大，且准确率较低。However, in the process of realizing the present invention, the inventor found that there are at least the following problems in the prior art: the computer automatic indexing adopted in the prior art cannot obtain information from the layout text block information and the manuscript text block information of the newspaper. Extract the preset text block information, such as: the name of the proofreader, the name of the layout designer, the name of the author, the name of the editor and other data information, so that the indexer needs to manually index one by one, which makes the workload of the indexer larger and less accurate.

发明内容Contents of the invention

本发明实施例提供了一种信息提取方法及装置，以实现从所述报纸的版面文字块信息和稿件文字块信息中自动提取出预设文字块信息。An embodiment of the present invention provides an information extraction method and device to automatically extract preset text block information from the newspaper's layout text block information and manuscript text block information.

为达到上述目的，本发明的实施例采用如下技术方案：In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:

一方面，本发明实施例提供了一种信息提取方法，包括：On the one hand, an embodiment of the present invention provides an information extraction method, including:

从版面文件中提取文字块信息，其中，所述文字块信息包括：版面文字块信息和稿件文字块信息；Extracting text block information from the layout file, wherein the text block information includes: layout text block information and manuscript text block information;

判断所述文字块信息中的预设版面文字块信息是否被提取；judging whether the preset layout text block information in the text block information is extracted;

如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息；If the preset layout text block information has not been extracted, then extract the preset layout text block information;

如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息。If the preset text block information of the layout has been extracted, the preset manuscript text block information is extracted.

另一方面，本发明实施例提供了一种信息提取装置，包括：On the other hand, an embodiment of the present invention provides an information extraction device, including:

文字块信息提取单元，用于从版面文件中提取文字块信息，其中，所述文字块信息包括：版面文字块信息和稿件文字块信息；A text block information extraction unit, configured to extract text block information from the layout file, wherein the text block information includes: layout text block information and manuscript text block information;

判断单元，用于判断所述文字块信息中的预设版面文字块信息是否被提取；a judging unit, configured to judge whether the preset layout text block information in the text block information is extracted;

预设版面提取单元，用于如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息；A default layout extracting unit, configured to extract the preset layout text block information if the preset layout text block information has not been extracted;

预设稿件提取单元，用于如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息。The preset manuscript extracting unit is configured to extract the preset manuscript text block information if the preset layout text block information has been extracted.

本发明实施例提供的一种信息提取方法及装置，通过判断所述文字块信息中的预设版面文字块信息是否被提取，可以防止同一预设版面文字块信息重复被提取；如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息，从而实现了预设版面文字块信息的自动提取；如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息，从而实现了预设稿件文字块信息的自动提取。An information extraction method and device provided by an embodiment of the present invention can prevent repeated extraction of the same preset layout text block information by judging whether the preset layout text block information in the text block information is extracted; if the The preset layout text block information is not extracted, then extract the preset layout text block information, thereby realizing the automatic extraction of the preset layout text block information; if the preset layout text block information has been extracted, extract The text block information of the manuscript is preset, thereby realizing the automatic extraction of the text block information of the preset manuscript.

附图说明Description of drawings

图1为本发明实施例提供的一种信息提取方法流程图；FIG. 1 is a flowchart of an information extraction method provided by an embodiment of the present invention;

图2为本发明实施例提供的一种信息提取方法具体实现流程图；Fig. 2 is a specific implementation flowchart of an information extraction method provided by an embodiment of the present invention;

图3为本发明实施例提供的一种信息提取装置结构示意图。Fig. 3 is a schematic structural diagram of an information extraction device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明实施例提供的一种信息提取方法及装置进行详细的说明。An information extraction method and device provided in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示，为本发明实施例提供的一种信息提取方法，该方法，具体实现过程如下：As shown in Figure 1, it is an information extraction method provided by the embodiment of the present invention, and the specific implementation process of the method is as follows:

101：从版面文件中提取文字块信息，其中，所述文字块信息包括：版面文字块信息和稿件文字块信息；其中，所述版面文件可以理解为报纸的某个版面通过标引软件所反解出来的数字化信息。所述从版面文件中提取文字块信息就是从所述报纸版面的数字化信息中提取文字块信息。101: Extract text block information from the layout file, wherein the text block information includes: layout text block information and manuscript text block information; wherein, the layout file can be understood as a page of a newspaper that is reversed by an indexing software The decoded digital information. The extracting the text block information from the layout file is to extract the text block information from the digitized information of the newspaper layout.

102：判断所述文字块信息中的预设版面文字块信息是否被提取；102: Determine whether the preset layout text block information in the text block information is extracted;

103：如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息；103: If the preset layout text block information has not been extracted, extract the preset layout text block information;

104：如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息。104: If the preset layout text block information has been extracted, extract the preset manuscript text block information.

基于以上实施例，如图2所示，为本发明实施例提供的一种信息提取方法具体实现流程图。当需要提取某种预设版面文字块信息和预设稿件文字块信息时，则需要进行如下流程：Based on the above embodiments, as shown in FIG. 2 , it is a specific implementation flowchart of an information extraction method provided by an embodiment of the present invention. When it is necessary to extract certain preset layout text block information and preset manuscript text block information, the following process is required:

201：设置所述预设版面文字块信息的正则表达式匹配规则、所述预设稿件文字块信息的正则表达式匹配规则以及所述预设稿件文字块信息的特征信息；其中，所述预设版面文字块信息的正则表达式匹配规则和所述预设稿件文字块信息的正则表达式匹配规则可以通过正则表达式的形式进行表示；所述的预设稿件文字块信息的特征信息则可以包括：字体信息和位置信息。通过所述预设版面文字块信息的正则表达式匹配规则可以从文字块信息中提取到所述预设版面文字块信息；通过所述预设稿件文字块信息的正则表达式匹配规则可以从文字块信息中提取到所述预设稿件文字块信息；为了更加准确的获取到所述预设稿件文字块信息可以首先通过所述预设稿件文字块信息的特征信息缩小获取所述预设稿件文字块信息的匹配范围，然后在所述范围中再进行预设稿件文字块信息的匹配。201: Set the regular expression matching rules of the preset layout text block information, the regular expression matching rules of the preset manuscript text block information, and the feature information of the preset manuscript text block information; wherein, the preset The regular expression matching rules of the layout text block information and the regular expression matching rules of the preset manuscript text block information can be expressed in the form of regular expressions; the feature information of the preset manuscript text block information can be Including: font information and location information. The regular expression matching rule of the preset layout text block information can extract the preset layout text block information from the text block information; the regular expression matching rule of the preset manuscript text block information can be extracted from the text The preset manuscript text block information is extracted from the block information; in order to obtain the preset manuscript text block information more accurately, the preset manuscript text can be obtained by narrowing down the feature information of the preset manuscript text block information The matching range of the block information, and then match the text block information of the preset manuscript within the range.

202：从版面文件中提取文字块信息，其中，所述文字块信息包括：版面文字块信息和稿件文字块信息；202: Extract text block information from the layout file, wherein the text block information includes: layout text block information and manuscript text block information;

203：判断所述文字块信息中的预设版面文字块信息是否被提取；203: Determine whether the preset layout text block information in the text block information is extracted;

204：如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息；其具体的实现过程如下：204: If the preset layout text block information has not been extracted, extract the preset layout text block information; the specific implementation process is as follows:

S11：所述如果所述的预设版面文字块信息未被提取，获取所述预设版面文字块信息的正则表达式匹配规则；根据所述预设版面文字块信息的正则表达式匹配规则从所述版面文字块信息中提取所述预设版面文字块信息；其中，所述的预设版面文字块信息可以为版面信息中的编辑名称、校对员名称、版式设计员名称等等；所述的预设版面文字块信息的正则表达式匹配规则可以根据所述具体需要进行提取的版面文字块信息进行设置。S11: If the preset layout text block information has not been extracted, obtain the regular expression matching rule of the preset layout text block information; according to the regular expression matching rule of the preset layout text block information from The preset layout text block information is extracted from the layout text block information; wherein, the preset layout text block information may be the name of the editor, the name of the proofreader, the name of the layout designer, etc. in the layout information; The regular expression matching rules of the preset layout text block information can be set according to the specific needs of the layout text block information to be extracted.

S12：将所述预设版面文字块信息的提取标识设置为已提取状态。S12: Set the extraction flag of the preset layout text block information to an extracted state.

需要注意的是，为了保证所述预设版面文字块信息提取的准确性，还可以对所述提取到的预设版面文字块信息进行如下操作。It should be noted that, in order to ensure the accuracy of the extraction of the preset layout text block information, the following operations may also be performed on the extracted preset layout text block information.

S13：校验所述预设版面文字块信息，并给出校验结果；具体的校验过程为：设所述预设版面文字块信息为所述版面文字块信息中的编辑名称；可以通过将所述提取到的编辑名称与预先存储的编辑名称库中的名称进行匹配，如果所述编辑名称库中存在该编辑名称，则认为所述提取的预设版面文字块信息正确，即校验结果为100％正确；如果所述提取到的编辑名称与预先存储的编辑名称库中的名称部分匹配，或者完全不匹配，则根据匹配状态给出正确率，即校验结果为50％正确，或者0％正确。S13: Verify the preset layout text block information, and give the verification result; the specific verification process is: set the preset layout text block information as the edit name in the layout text block information; Matching the extracted editing name with the name in the pre-stored editing name library, if the editing name exists in the editing name library, it is considered that the extracted preset layout text block information is correct, that is, verifying The result is 100% correct; if the extracted edit name partially matches the name in the pre-stored edit name library, or does not match at all, then the correct rate is given according to the matching status, that is, the verification result is 50% correct, Or 0% correct.

S14：根据所述校验结果，标识所述校验的预设版面文字块信息。例如：将100％正确的预设版面文字块信息标识为白色；将50％正确的预设版面文字块信息标识为黄色；将0％正确的预设版面文字块信息标识为红色。S14: According to the verification result, identify the verified text block information of the preset layout. For example: 100% correct preset layout text block information is marked as white; 50% correct preset layout text block information is marked as yellow; 0% correct preset layout text block information is marked as red.

205：如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息；其具体的实现过程可以为：205: If the preset layout text block information has been extracted, then extract the preset manuscript text block information; the specific implementation process may be:

如果所述的预设版面文字块信息已被提取，获取所述预设稿件文字块信息的正则表达式匹配规则；根据所述预设稿件文字块信息的正则表达式匹配规则从所述版面文字块信息中提取所述预设稿件文字块信息。If the preset layout text block information has been extracted, obtain the regular expression matching rule of the preset manuscript text block information; according to the regular expression matching rule of the preset manuscript text block information from the layout text The preset manuscript text block information is extracted from the block information.

为了更加准确的提取到所述预设版面文字块信息，本发明实施例提取预设稿件文字块信息的过程还可以通过如下过程实现：设以下提取的预设稿件文字块信息为作者姓名；In order to more accurately extract the text block information of the preset layout, the process of extracting the text block information of the preset manuscript in the embodiment of the present invention can also be realized through the following process: the following extracted text block information of the preset manuscript is the name of the author;

S21：当所述预设稿件文字块信息的特征信息包括：字体信息时，如果所述的预设版面文字块信息已被提取，根据所述预设稿件文字块信息的字体信息获取所述预设稿件文字块信息集合。例如：设字体信息为：黑体；则如果所述的预设版面文字块信息已被提取，就将所述稿件文字块信息中所有字体为黑体的文字块信息都提取出来，将所述提取出来的信息组合为预设稿件文字块信息集合{T}。S21: When the feature information of the preset manuscript text block information includes font information, if the preset layout text block information has been extracted, obtain the preset text block information according to the font information of the preset manuscript text block information. Set the collection of manuscript text block information. For example: set the font information as: black body; then if the text block information of the preset layout has been extracted, all the text block information in the text block information of the manuscript whose font is bold is extracted, and the extracted The combination of information is the preset manuscript text block information set {T}.

为了进一步准确的获取到所述预设稿件文字块信息，本发明实施例还可以通过设置特征信息中包括：位置信息来进一步缩小获取所述预设稿件文字块信息的范围；当获取到预设稿件文字块信息集合{T}后，继续进行如下操作：In order to obtain the preset manuscript text block information more accurately, the embodiment of the present invention can further narrow the scope of obtaining the preset manuscript text block information by setting the feature information to include: position information; After the manuscript text block information collection {T}, proceed as follows:

S22：当所述预设稿件文字块信息的特征信息还包括：位置信息时，对所述预设稿件文字块信息集合进行预处理，分别获取得到所述预设稿件文字块信息集合{Ts}及所述预设稿件文字块信息集合{Te}；例如：设位置信息为：所述预设稿件文字块信息集合内容的开头到第一个出现参考符的位置Ps；和/或，所述预设稿件文字块信息集合内容的结尾到最后一个出现参考符的位置Pe。S22: When the feature information of the preset manuscript text block information also includes location information, perform preprocessing on the preset manuscript text block information set, and respectively obtain the preset manuscript text block information set {Ts} And the preset manuscript text block information set {Te}; for example: set the position information as: the beginning of the preset manuscript text block information set content to the position Ps where the first reference character appears; and/or, the Preset the position Pe from the end of the content of the manuscript text block information set to the last occurrence of the reference symbol.

对所述预设稿件文字块信息集合{T}进行预处理的过程具体可以包括：所述预设稿件文字块信息集合{T}中可能存在字体描述不一致的问题所导致的待提取内容T中存在括号不一致的问题。The process of preprocessing the preset manuscript text block information set {T} may specifically include: the content T to be extracted may be caused by inconsistent font descriptions in the preset manuscript text block information set {T} There is an issue with inconsistent parentheses.

S23：按照所述位置信息，从所述预设稿件文字块信息集合{T}中提取所述预设稿件文字块信息的子集{A}；具体的讲，就是可以首先按照所述位置信息Ps，从所述预设稿件文字块信息集合{Ts}提取相应的信息a1，如果提取到a1，则将a1作为子集{A}；如果未提取到a1，则再按照所述位置信息Pe，从所述预设稿件文字块信息集合{Te}提取相应的信息a2，将a2作为子集{A}。S23: According to the location information, extract the subset {A} of the preset manuscript text block information from the preset manuscript text block information set {T}; Ps, extract the corresponding information a1 from the preset manuscript text block information set {Ts}, if a1 is extracted, use a1 as the subset {A}; if a1 is not extracted, then follow the position information Pe , extract corresponding information a2 from the preset manuscript text block information set {Te}, and use a2 as a subset {A}.

S24：根据所述设置的预设稿件文字块信息的正则表达式匹配规则，从所述预设稿件文字块信息的子集中提取所述预设稿件文字块信息；设所述预设稿件文字块信息的正则表达式匹配规则的匹配级别数量为4；其中，所述匹配级别1的正则表达式匹配规则数量为3，所述匹配级别2的正则表达式匹配规则数量为3，所述匹配级别3的正则表达式匹配规则数量为2，所述匹配级别4的正则表达式匹配规则数量为1；间隔符为逗号或分号；所述各个匹配级别的正则表达式匹配规则组成一个匹配集；该步骤具体可以包括：S24: Extract the preset manuscript text block information from the subset of the preset manuscript text block information according to the set regular expression matching rule of the preset manuscript text block information; set the preset manuscript text block information The number of matching levels of the regular expression matching rules of information is 4; wherein, the number of regular expression matching rules of the matching level 1 is 3, the number of regular expression matching rules of the matching level 2 is 3, and the matching level The number of regular expression matching rules of 3 is 2, the number of regular expression matching rules of the matching level 4 is 1; the separator is a comma or semicolon; the regular expression matching rules of each matching level form a matching set; This step can specifically include:

按照匹配级别依次从所述匹配集中获取所述匹配级别对应的正则表达式匹配规则；所述正则表达式匹配规则描述方式为正则表达式。该步骤具体为：A regular expression matching rule corresponding to the matching level is sequentially obtained from the matching set according to the matching level; the description mode of the regular expression matching rule is a regular expression. The steps are specifically:

首先，从匹配集中获取匹配级别1所对应的3个正则表达式匹配规则；该规则如下：First, obtain the three regular expression matching rules corresponding to matching level 1 from the matching set; the rules are as follows:

规则1可以为：∧(.*？(记者|记者组|作者|实习生|通讯员|文∨摄|∨摄|文∨图|插图|漫画|制图|实习记者|∨文|评论员|点评).*？\)/g；Rule 1 can be: ∧(.*?(Reporter|Reporter Group|Author|Intern|Correspondent|Text∨Photo|∨Photo|Text∨Graph|Illustration|Caricature|Drawing|Practice Reporter|∨Text|Reviewer|Comment ).*?\)/g;

上述正则表达式表示全文匹配“(”，并且匹配“非回车符”零到无限次，并且匹配“记者”或“记者组”或“作者”或“实习生”或“通讯员”或“文/摄”或“/摄”或“文/图”或“插图”或“漫画”或“制图”或“实习记者”或“/文”或“评论员”或“点评”，并且匹配“非回车符”零到无限次，并且匹配“)”。The above regular expression means that the full text matches "(", and matches "non-carriage return" zero to unlimited times, and matches "reporter" or "reporter group" or "author" or "intern" or "correspondent" or "text /photo” or “/photo” or “text/graph” or “illustration” or “caricature” or “drawing” or “intern reporter” or “/text” or “commentator” or “comment”, and match “non Carriage return" zero to unlimited times, and matches ")".

规则2可以为：∧(\s*([\u4e00-\u9fa5]{2，5}\s+[\u4e00-\u9fa5]{2，5}\s*)+\s*\)/g；Rule 2 can be: ∧(\s*([\u4e00-\u9fa5]{2, 5}\s+[\u4e00-\u9fa5]{2, 5}\s*)+\s*\)/g;

上述正则表达式表示全文匹配“(”，并且匹配“空白字符”零到无限次，并且匹配2个到5个中文字符，并且匹配一个“空白字符”，并且匹配2个到5个中文字符，并且匹配“空白字符”零到无限次，并且匹配“空白字符”零到无限次，并且匹配“)”。The above regular expression means that the full text matches "(", and matches "blank characters" zero to unlimited times, and matches 2 to 5 Chinese characters, and matches a "blank character", and matches 2 to 5 Chinese characters, And matches "whitespace character" zero to infinite times, and matches "whitespace character" zero to infinite number of times, and matches ")".

规则3可以为：/(记者|记者组|作者|实习生|通讯员|实习记者|评论员|制图|漫画|插图|撰稿)(:|\s|∨)*[\u4e00-\u9fa5]{2，6}\s*(？＝($|\n|∨摄|∨文|发自|综合报道|文∨摄|∨画|文并摄|摄影报道|∨绘图|整理|摘录整合|摄|[\u4e00-\u9fa5]{2，5}专电|摄影|文∨图|报道|采写|$本报[\u4e00-\u9fa5]*电$|本版[\u4e00-\u9fa5]*))/g；Rule 3 can be: /(reporter|reporter|author|intern|correspondent|intern reporter|commentator|drawing|comic|illustration|writing)(:|\s|∨)*[\u4e00-\u9fa5] {2, 6}\s*(?＝($|\n|∨Photo|∨Text|From|Comprehensive Report|Text∨Photo|∨Painting|Text and Photo|Photography|∨Drawing|Organization|Excerpt Integration |Photography|[\u4e00-\u9fa5]{2, 5}Special Telegraph|Photography|Text∨Graph|Report|Written|$This Newspaper[\u4e00-\u9fa5]*electricity$|This edition [\u4e00-\ u9fa5]*))/g;

上述正则表达式“(:|\s|∨)*[\u4e00-\u9fa5]{2，6}\s*”表示匹配“:”或“空白字符”或“/”零到无限次，并且匹配2个到6个中文字符，并且匹配“空白字符”零到无限次；The above regular expression "(:|\s|∨)*[\u4e00-\u9fa5]{2, 6}\s*" means to match ":" or "blank character" or "/" zero to infinite times, and Match 2 to 6 Chinese characters, and match "blank characters" zero to unlimited times;

上述正则表达式“(？＝”表示断言要匹配的文本的后缀；The above regular expression "(?=" indicates the suffix of the text to be matched by the assertion;

上述正则表达式“($|\n|∨摄|∨文|发自|综合报道|文∨摄|∨画|文并摄|摄影报道|∨绘图|整理|摘录整合|摄|[\u4e00-\u9fa5]{2，5}专电|摄影|文∨图|报道|采写|$本报[\u4e00-\u9fa5]*电$|本版[\u4e00-\u9fa5]*)”为后缀内容，即匹配位置后面紧跟是字符串结尾或一个“回车符”，或匹配如下任一字符串：“/摄”、“/文”、“发自”、“综合报道”、“文/摄”、“/画”、“文并摄”、“摄影报道”、“/绘图”、“整理”、“摘录整合”、“摄”、“摄影”“文/图”“报道”“采写”或匹配“本报”后面紧跟一个以上中文字符并最后紧跟“电”或匹配“本版”后面紧跟一个以上中文字符；The above regular expression "($|\n|∨Photo|∨Text|From|Comprehensive Report|Text∨Photo|∨Painting|Text and Photo|Photography|∨Drawing|Organization|Excerpt Integration|Photo|[\u4e00 -\u9fa5]{2,5}Special Electric|Photography|Text∨Graph|Report|Writing|$This Newspaper[\u4e00-\u9fa5]*electricity$|This edition[\u4e00-\u9fa5]*)" is Suffix content, that is, the matching position is followed by the end of the string or a "carriage return", or any of the following strings: "/photo", "/text", "sent from", "comprehensive report", " Text/Photography,/Painting, Text and Photo, Photographic Reporting,/Drawing, Arranging, Excerpt Integration, Photographing, Photography, Text/Picture, Reporting "采写" or match "this newspaper" followed by more than one Chinese character followed by "dian" or match "this edition" followed by more than one Chinese character;

上述“/g”表示全文查找出现的所有匹配字符。The above "/g" indicates all matching characters that appear in the full-text search.

其次，从匹配集中获取匹配级别2所对应的3个正则表达式匹配规则；该规则如下：Secondly, three regular expression matching rules corresponding to matching level 2 are obtained from the matching set; the rules are as follows:

规则1：/(\.|，|\？|！|^|\r|\n|∨摄|文∨摄|∨画|文并摄|∨绘图|文∨图|∨文字整理|∨实习生|∨文)\s*[\u4e00-\u9fa5]{2，4}\s*(？＝((摄[\n$])|∨文|文∨摄|∨画|文并摄|∨绘图|摄影|文∨图|∨文字整理|∨实习生))/g；Rule 1: /(\.|,|\?|!|^|\r|\n|∨Photo|Text∨Photo|∨Painting|Text Combined|∨Drawing|Text∨Graph|∨Text Arrangement|∨Practice Health|∨text)\s*[\u4e00-\u9fa5]{2, 4}\s*(?＝((photo[\n$])|∨text|text∨photo|∨painting|text and photo| ∨Drawing|Photography|Text∨Graph|∨Text Arrangement|∨Intern))/g;

上述正则表达式“(\.|，|\？|！|^|\r|\n|∨摄|文∨摄|∨画|文并摄|∨绘图|文∨图|∨文字整理|∨实习生|∨文)”表示匹配“.”或“，”或“？”或“！”或“\r”或“\n”或“/摄”或“文/摄”或“/画”或“文并摄”或“/绘图”或“文/图”或“/文字整理”或“/实习生”或“/文”；The above regular expression "(\.|,|\?|!|^|\r|\n|∨photo|text∨photo|∨painting|text and photo|∨drawing|text∨graph|∨text arrangement|∨ Intern|∨文)" means matching "." or "," or "?" or "!" or "\r" or "\n" or "/photo" or "text/photo" or "/painting" Or "text and photo" or "/drawing" or "text/graph" or "/text arrangement" or "/intern" or "/text";

上述正则表达式“s*[\u4e00-\u9fa5]{2，4}\s*”表示匹配“空白字符”零到无限次，并且匹配2个到4个中文字符，并且匹配“空白字符”零到无限次；The above regular expression "s*[\u4e00-\u9fa5]{2, 4}\s*" means matching "blank characters" zero to unlimited times, and matching 2 to 4 Chinese characters, and matching "blank characters" zero to infinite times;

规则2：/(记者|记者组|作者|实习生|通讯员|实习记者|评论员)(\s*|∨)+[\u4e00-\u9fa5]{2，4}(\s+[\u4e00-\u9fa5]{2，6}){1，}\s*(？＝($|\n|∨摄|发自|综合报道|文∨摄|∨画|文并摄|摄影报道|报道摄影|∨绘图|整理|摄|[\u4e00-\u9fa5]{2，5}专电|摄影|文∨图|报道|采写|))/g；Rule 2: /(reporter|reporter|author|intern|correspondent|intern reporter|commentator)(\s*|∨)+[\u4e00-\u9fa5]{2, 4}(\s+[\u4e00- \u9fa5]{2,6}){1,}\s*(?＝($|\n|∨Photo|From|Comprehensive Report|Text∨Photo|∨Painting|Text and Photo|Photo Report|Report Photography |∨Drawing|Sorting|Shooting|[\u4e00-\u9fa5]{2, 5}Special Electric|Photography|Text∨Graph|Report|Written|))/g;

上述正则表达式“(\s*|∨)”表示匹配“空白字符”零次或更多次，或匹配“/”；其中“+”表示并且匹配“(\s*|∨)”一次以上；The above regular expression "(\s*|∨)" means matching "blank character" zero or more times, or matching "/"; where "+" means and matching "(\s*|∨)" more than once ;

上述正则表达式“[\u4e00-\u9fa5]{2，4}”表示匹配2个到4个中文字符；The above regular expression “[\u4e00-\u9fa5]{2, 4}” means matching 2 to 4 Chinese characters;

上述正则表达式“(\s+[\u4e00-\u9fa5]{2，6}){1，}”表示如下，“(\s+[\u4e00-\u9fa5]{2，6}”表示重复匹配“空白字符”一次以上，匹配2个大到6个中文字符。“{1，}”表示重复匹配“(\s+[\u4e00-\u9fa5]{2，6}”一次以上；The above regular expression "(\s+[\u4e00-\u9fa5]{2, 6}) {1, }" is expressed as follows, "(\s+[\u4e00-\u9fa5]{2, 6}" means repeated matching" Blank character" more than once, match 2 up to 6 Chinese characters. "{1,}" means repeat matching "(\s+[\u4e00-\u9fa5]{2,6}" more than once;

上述正则表达式“\s*”表示重复匹配“空白字符”零次或更多次；The above regular expression "\s*" means repeat matching "blank character" zero or more times;

上述正则表达式“($|\n |∨摄|发自|综合报道|文∨摄|∨画|文并摄|摄影报道|报道摄影|∨绘图|整理|摄|[\u4e00-\u9fa5]{2，5}专电|摄影|文∨图|报道|采写|)”为上面所述后缀内容，表示匹配位置紧跟是字符串结尾或回车符或“/摄”或“发自”或“综合报道”或“文/摄”或“/画”或“文并摄”或“摄影报道”或“道摄影”或“/绘图”或“整理”或“摄”或2个到5个中文字符后面跟着“专电”或“摄影”或“文/图”或“报道|”或“采写”；最后紧跟的“)”表示后缀结束；The above regular expression "($|\n|∨Photo|From|Comprehensive Report|Text∨Photo|∨Painting|Text and Photo|Photography|Report Photography|∨Drawing|Organization|Photo|[\u4e00-\u9fa5 ]{2,5}Special Telegraph|Photo|Text∨Graph|Report|Writing|)" is the suffix content mentioned above, indicating that the matching position is followed by the end of the string or carriage return or "/Photo" or "From" Or "comprehensive report" or "text/photo" or "/painting" or "text and photo" or "photographic report" or "road photography" or "/drawing" or "sorting" or "photographing" or 2 to 5 Chinese characters followed by "special electricity" or "photography" or "text/picture" or "report |" or "collection"; the last ")" means the end of the suffix;

规则3：/(◆|●|□|◎)\s*.*(？＝($|\r|\n))/g；Rule 3: /(◆|●|□|◎)\s*.*(?=($|\r|\n))/g;

上述正则表达式“(◆|●|□|◎)\”表示字符串匹配“◆”或“●”或“□”或“◎”；The above regular expression "(◆|●|□|◎)\" indicates that the string matches "◆" or "●" or "□" or "◎";

上述正则表达式“\s*.*”表示重复匹配空白字符零次或更多次，重复匹配非换行符零次或更多次；The above regular expression "\s*.*" means zero or more times of repeated matching of blank characters, and zero or more times of repeated matching of non-newline characters;

上述正则表达式“($|\r|\n)”表示后缀内容，匹配字符串结尾或回车换行符最后紧跟的“)”表示后缀结束；The above regular expression "($|\r|\n)" indicates the content of the suffix, and the ")" immediately following the end of the matching string or carriage return and line feed indicates the end of the suffix;

再次，从匹配集中获取匹配级别3所对应的2个正则表达式匹配规则；该规则如下：Again, the two regular expression matching rules corresponding to matching level 3 are obtained from the matching set; the rules are as follows:

规则1：∧(\s*[\u4e00-\u9fa5]{2，4}(\s+[\u4e00-\u9fa5]{2，6}){1，}\s*\)/g；Rule 1: ∧(\s*[\u4e00-\u9fa5]{2, 4}(\s+[\u4e00-\u9fa5]{2, 6}){1,}\s*\)/g;

上述正则表达式“\s*[\u4e00-\u9fa5]{2，4}”表示重复匹配“空白字符”零次或更多次，匹配2个到4个中文字符；The above regular expression "\s*[\u4e00-\u9fa5]{2, 4}" means repeating matching "blank character" zero or more times, matching 2 to 4 Chinese characters;

上述正则表达式“(\s+[\u4e00-\u9fa5]{2，6})”表示匹配“空白字符”一次以上，匹配2个到6个中文字符；The above regular expression "(\s+[\u4e00-\u9fa5]{2, 6})" means matching "blank character" more than once, matching 2 to 6 Chinese characters;

上述正则表达式“{1，}”表示匹配“(\s+[\u4e00-\u9fa5]{2，6})”一次以上；The above regular expression "{1,}" means to match "(\s+[\u4e00-\u9fa5]{2, 6})" more than once;

上述“/g”表示全文查找出现的所有匹配字符；The above "/g" means all matching characters that appear in the full-text search;

规则2：re＝∧(\s*[\u4e00-\u9fa5]{2，4}\s*\)/g；Rule 2: re=∧(\s*[\u4e00-\u9fa5]{2, 4}\s*\)/g;

上述正则表达式表示重复匹配“空白字符”零次或更多次，匹配2个到4个中文字符，重复匹配“空白字符”零次或更多次；The above regular expression means to repeatedly match "blank characters" zero or more times, match 2 to 4 Chinese characters, and repeatedly match "blank characters" zero or more times;

其中“/g”表示全文查找出现的所有匹配字符；Among them, "/g" means all matching characters that appear in the full-text search;

最后，从匹配集中获取匹配级别4所对应的1个正则表达式匹配规则；该规则如下：Finally, a regular expression matching rule corresponding to matching level 4 is obtained from the matching set; the rule is as follows:

规则1：/(\s+|^|\？|\.|！)[\u4e00-\u9fa5]{2，4}\s*(？＝((摄[\n$])|∨摄|∨文|文∨摄|∨画|文并摄|\s摄|∨绘图|摄影|文∨图|∨文字整理|∨实习生))/g；Rule 1: /(\s+|^|\?|\.|!)[\u4e00-\u9fa5]{2, 4}\s*(?＝((photo[\n$])|∨photo|∨ Text | Text ∨ Photo | ∨ Painting |

上述正则表达式“(\s+|^|\？|\.|！)”表示匹配“空白字符”一次以上或是字符串开头位置或匹配“？”或“.”或“！”；The above regular expression "(\s+|^|\?|\.|!)" means matching "blank character" more than once or the beginning of the string or matching "?" or "." or "!";

上述正则表达式“[\u4e00-\u9fa5]{2，4}\s*”表示匹配2个到4个中文字符，重复匹配“空白字符”零次或更多次；The above regular expression "[\u4e00-\u9fa5]{2, 4}\s*" means to match 2 to 4 Chinese characters, and repeat the matching "blank character" zero or more times;

其中“/g”表示全文查找出现的所有匹配字符。Among them, "/g" means all matching characters that appear in the full-text search.

根据所述获取到的正则表达式匹配规则，对所述预设稿件文字块信息的子集中的内容进行内容匹配，给出匹配结果；例如：根据所述匹配级别1的3个正则表达式匹配规则与所述预设稿件文字块信息的子集中的内容进行匹配，从而可以提取出来“作者，王一”，并将其加入到集合{B}，然后继续获取匹配级别2的3个正则表达式匹配规则与所述预设稿件文字块信息的子集中的内容进行匹配，未提取出任何信息；接着，获取匹配级别3的2个正则表达式匹配规则与所述预设稿件文字块信息的子集中的内容进行匹配，提取出来“通讯员，赵二”，并将其加入到集合{B}；最后，获取匹配级别4的1个正则表达式匹配规则与所述预设稿件文字块信息的子集中的内容进行匹配，提取出来“编辑张三”，并将其加入到集合{B}；所述集合{B}为{作者，王一，通讯员，赵二，编辑张三}。According to the obtained regular expression matching rules, content matching is performed on the content in the subset of the preset manuscript text block information, and a matching result is given; for example: matching according to the three regular expressions of the matching level 1 The rules are matched with the content in the subset of the preset manuscript text block information, so that "author, Wang Yi" can be extracted and added to the set {B}, and then continue to obtain the 3 regular expressions of matching level 2 The formula matching rules are matched with the content in the subset of the preset manuscript text block information, and no information is extracted; then, the two regular expression matching rules of matching level 3 and the content of the preset manuscript text block information are obtained. The content in the subset is matched, and "correspondent, Zhao Er" is extracted, and added to the set {B}; finally, the combination of a regular expression matching rule of matching level 4 and the preset manuscript text block information is obtained Match the content in the subset, extract "Editor Zhang San", and add it to the set {B}; the set {B} is {author, Wang Yi, correspondent, Zhao Er, editor Zhang San}.

在获取到所述集合{B}为{作者，王一，通讯员，赵二，编辑张三}时，还可以根据相应的过滤规则对匹配结果进行关键词过滤，得到作者姓名“王一”，将所述姓名提取到作者集{B1}中；依次将所述通讯员姓名“赵二”提取到通讯员姓名集{B2}中；将所述编辑姓名“张三”提取到编辑姓名集{B3}中。关键词过滤过程完成关键词去除过程，关键词如“作者”、“编辑”、“通讯员”等。When the collection {B} is obtained as {author, Wang Yi, correspondent, Zhao Er, editor Zhang San}, the matching result can also be filtered by keywords according to the corresponding filtering rules, and the author name "Wang Yi" can be obtained. Extract the name into the author set {B1}; sequentially extract the correspondent name "Zhao Er" into the correspondent name set {B2}; extract the editor name "Zhang San" into the edit name set {B3} middle. The keyword filtering process completes the keyword removal process, keywords such as "author", "editor", "correspondent" and so on.

需要说明的是，由于通过关键词过滤得到的结果中可能存在多个由特定标点符号(如逗号，分号)间隔的结果，如{王一，赵二，张三}，因此需要对结果集进行再提取。以特定标点符号为间隔符，切割字符串得到多个结果，如将“王一”加入结果集{A1}；将“赵二”加入结果集{A2}；将“张三”加入结果集{A3}，。It should be noted that, since there may be multiple results separated by specific punctuation marks (such as commas, semicolons) in the results obtained through keyword filtering, such as {Wang Yi, Zhao Er, Zhang San}, it is necessary to filter the result set Perform re-extraction. Use specific punctuation marks as separators to cut strings to obtain multiple results, such as adding "Wang Yi" to the result set {A1}; adding "Zhao Er" to the result set {A2}; adding "Zhang San" to the result set { A3},.

需要注意的是，匹配级别可以根据实验统计获得最佳值。It should be noted that the matching level can obtain the optimal value according to the experimental statistics.

正则表达式匹配规则都是以正则表达式的方式表达，由多个关键词组合而成。具体看相关参数描述，也可以根据具体不同实例配置。每个正则表达式匹配规则对应一个关键词替代规则。多级别的规则设置能最大程度的提取到所有作者；其中，所述包括记者姓名、通讯员姓名、摄影姓名、采编姓名、实习生姓名、文字整理姓名、评论员姓名等。Regular expression matching rules are all expressed in the form of regular expressions, which are composed of multiple keywords. Please refer to the relevant parameter descriptions for details, or configure them according to different instances. Each regular expression matching rule corresponds to a keyword substitution rule. Multi-level rule settings can extract all authors to the greatest extent; among them, the names include reporter names, correspondent names, photographer names, editor names, intern names, text editor names, commentator names, etc.

S25：将所述预设稿件文字块信息的子集进行信息再处理；该步骤的具体实现过程可以包括：所述将所述结果集{A1}、{A2}、{A3}...{An}合并到结果集{A}；然后，再将所述结果集{A}进行消重和漏处理的关键词二次过滤。具体的讲就是将结果集{A}中内容相同的信息项去除，并将对所述结果集{A}进行再次关键词过滤。S25: Perform information reprocessing on the subset of preset manuscript text block information; the specific implementation process of this step may include: the described result set {A1}, {A2}, {A3}...{ An} is merged into the result set {A}; then, the result set {A} is subjected to secondary filtering of keywords for deduplication and omission processing. Specifically, information items with the same content in the result set {A} are removed, and the result set {A} is filtered again by keywords.

S26：从所述再处理后的所述预设稿件文字块信息的子集中提取所述预设稿件文字块信息。S26: Extract the preset manuscript text block information from the subset of the pre-processed preset manuscript text block information.

需要注意的是，该方法还包括：It should be noted that this method also includes:

S27：校验所述预设稿件文字块信息，并给出校验结果；其具体的校验过程可以利用预先存储的的字典信息验证提取所述预设稿件文字块信息即作者姓名集{A}的正确率，步骤如下：S27: Verify the text block information of the preset manuscript, and give the verification result; the specific verification process can use the pre-stored dictionary information to verify and extract the text block information of the preset manuscript, that is, the author name set{A } correct rate, the steps are as follows:

步骤1：依次获取作者A，对比已建好的作者名字典，查看是否都存在，存在，则标识此作者集{A}正确率为100％。对某些部分匹配，或是完全不匹配，对作者集{A}分别标识60％，0的正确率。Step 1: Obtain author A in turn, compare the established author name dictionary, and check whether they all exist. If yes, the correct rate of identifying this author set {A} is 100%. For some partial matches, or no matches at all, 60% and 0 correctness rates are respectively identified for the author set {A}.

步骤2：设置好覆盖率为95％中文姓氏字典，对正确率为不是100％的作者集进行二次正确率计算，获取作者字符串的第一个字符，对比姓氏字典，如果存在，则提升正确率。如不存在，获取作者字符串的前两个字符，对比姓氏字典，如果存在，则提升正确率，否则降低。Step 2: Set up the Chinese surname dictionary with a coverage rate of 95%, and perform a secondary accuracy rate calculation on the author set whose accuracy rate is not 100%, obtain the first character of the author string, compare it with the surname dictionary, and improve if it exists Correct rate. If it does not exist, get the first two characters of the author string and compare it with the surname dictionary. If it exists, the correct rate will increase, otherwise it will decrease.

S28：根据所述校验结果，标识所述校验的预设稿件文字块信息。S28: According to the verification result, identify the verified text block information of the preset manuscript.

如图3所示，为本发明实施例提供的一种信息提取装置，该装置包括：As shown in Figure 3, it is an information extraction device provided by an embodiment of the present invention, the device includes:

文字块信息提取单元301，用于从版面文件中提取文字块信息，其中，所述文字块信息包括：版面文字块信息和稿件文字块信息；A text block information extraction unit 301, configured to extract text block information from the layout file, wherein the text block information includes: layout text block information and manuscript text block information;

判断单元302，用于判断所述文字块信息中的预设版面文字块信息是否被提取；A judging unit 302, configured to judge whether the preset layout text block information in the text block information is extracted;

预设版面提取单元303，用于如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息；A preset layout extracting unit 303, configured to extract the preset layout text block information if the preset layout text block information has not been extracted;

预设稿件提取单元304，用于如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息。The preset manuscript extracting unit 304 is configured to extract the preset manuscript text block information if the preset layout text block information has been extracted.

需要注意的是，该装置还包括：It should be noted that the device also includes:

设置单元，用于设置所述预设版面文字块信息的正则表达式匹配规则、所述预设稿件文字块信息的正则表达式匹配规则以及所述预设稿件文字块信息的特征信息。A setting unit, configured to set the regular expression matching rules of the preset layout text block information, the regular expression matching rules of the preset manuscript text block information, and the feature information of the preset manuscript text block information.

需要注意的是，所述预设版面提取单元303，包括：It should be noted that the preset layout extraction unit 303 includes:

规则获取子单元，用于获取所述预设版面文字块信息的正则表达式匹配规则；A rule acquisition subunit, configured to acquire the regular expression matching rules of the preset layout text block information;

预设版面提取子单元，用于根据所述预设版面文字块信息的正则表达式匹配规则从所述版面文字块信息中提取所述预设版面文字块信息；The preset layout extraction subunit is used to extract the preset layout text block information from the layout text block information according to the regular expression matching rule of the preset layout text block information;

标识设置子单元，用于将所述预设版面文字块信息的提取标识设置为已提取状态。The flag setting subunit is used to set the extraction flag of the preset layout text block information to an extracted state.

还需要注意的是，所述预设版面提取单元303，还包括：It should also be noted that the preset layout extraction unit 303 also includes:

校验子单元，用于校验所述预设版面文字块信息，并给出校验结果；A verification sub-unit is used to verify the text block information of the preset layout and give the verification result;

标识子单元，用于根据所述校验结果，标识所述校验的预设版面文字块信息。The identification subunit is configured to identify the verified preset layout text block information according to the verification result.

还需要注意的是，所述预设稿件提取单元304，还用于获取所述预设稿件文字块信息的正则表达式匹配规则，根据所述预设稿件文字块信息的正则表达式匹配规则从所述版面文字块信息中提取所述预设稿件文字块信息；或者，It should also be noted that the preset manuscript extracting unit 304 is also used to obtain the regular expression matching rule of the preset manuscript text block information, and according to the regular expression matching rule of the preset manuscript text block information from Extracting the preset manuscript text block information from the layout text block information; or,

当所述预设稿件文字块信息的特征信息包括：字体信息和所述位置信息时，所述预设稿件提取单元304，还用于根据所述预设稿件文字块信息的字体信息获取所述预设稿件文字块信息集合，根据所述预设稿件文字块信息的字体信息，获取所述预设稿件文字块信息集合；对所述预设稿件文字块信息集合进行预处理；按照所述位置信息，从所述预设稿件文字块信息集合中提取所述预设稿件文字块信息的子集；根据所述设置的预设稿件文字块信息的正则表达式匹配规则，从所述预设稿件文字块信息的子集中提取所述预设稿件文字块信息。When the feature information of the preset manuscript text block information includes: font information and the position information, the preset manuscript extracting unit 304 is further configured to obtain the The preset manuscript text block information set, according to the font information of the preset manuscript text block information, obtain the preset manuscript text block information set; preprocess the preset manuscript text block information set; according to the position Information, extracting a subset of the preset manuscript text block information from the preset manuscript text block information set; according to the set regular expression matching rules of the preset manuscript text block information, from the preset manuscript The text block information of the preset manuscript is extracted from the subset of text block information.

还需要注意的是，所述预设稿件提取单元304，包括：It should also be noted that the preset manuscript extraction unit 304 includes:

信息再处理子单元，用于将所述预设稿件文字块信息的子集进行信息再处理；An information reprocessing subunit, configured to perform information reprocessing on a subset of the preset manuscript text block information;

预设稿件提取子单元，用于从所述再处理后的所述预设稿件文字块信息的子集中提取所述预设稿件文字块信息。The preset manuscript extracting subunit is configured to extract the preset manuscript text block information from the subset of the pre-processed preset manuscript text block information.

还需要注意的是，所述预设稿件提取单元304，还包括：It should also be noted that the preset manuscript extracting unit 304 also includes:

校验子单元，用于校验所述预设稿件文字块信息，并给出校验结果；A verifying subunit, used to verify the text block information of the preset manuscript, and give the verification result;

标识子单元，用于根据所述校验结果，标识所述校验的预设稿件文字块信息。The identification subunit is configured to identify the verified text block information of the preset manuscript according to the verification result.

本发明实施例提供的一种信息提取方法及装置，通过判断所述文字块信息中的预设版面文字块信息是否被提取，可以防止同一预设版面文字块信息重复被提取；如果所述的预设版面文字块信息未被提取，则提取所述预设版面文字块信息，从而实现了预设版面文字块信息的自动提取；如果所述的预设版面文字块信息已被提取，则提取预设稿件文字块信息，从而实现了预设稿件文字块信息的自动提取。与现有技术相比，本发明实施例不但可以自动的提取到的预设版面文字块信息和预设稿件文字块信息，还可以进一步通过预先存储的库信息与所述提取到的预设版面文字块信息和预设稿件文字块信息进行比较，从而提高所述提取预设版面文字块信息和预设稿件文字块信息的准确性，从而大大降低了标引人员的工作量，提高了提取的准确率。其中，所述提取预设稿件文字块信息的过程本发明还通过特征信息将提取所述预设稿件文字块信息的范围缩小，进一步提高了提取所述预设稿件文字块信息准确率。An information extraction method and device provided by an embodiment of the present invention can prevent repeated extraction of the same preset layout text block information by judging whether the preset layout text block information in the text block information is extracted; if the The preset layout text block information is not extracted, then extract the preset layout text block information, thereby realizing the automatic extraction of the preset layout text block information; if the preset layout text block information has been extracted, extract The text block information of the manuscript is preset, thereby realizing the automatic extraction of the text block information of the preset manuscript. Compared with the prior art, the embodiment of the present invention can not only automatically extract the preset layout text block information and the preset manuscript text block information, but also can further use the pre-stored library information and the extracted preset layout The text block information is compared with the text block information of the preset manuscript, thereby improving the accuracy of extracting the text block information of the preset layout and the text block information of the preset manuscript, thereby greatly reducing the workload of the indexing personnel and improving the extraction efficiency. Accuracy. Wherein, in the process of extracting the text block information of the preset manuscript, the present invention also narrows the range of extracting the text block information of the preset manuscript through feature information, and further improves the accuracy rate of extracting the text block information of the preset manuscript.

通过以上的实施方式的描述，本领域普通技术人员可以理解：实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，所述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，包括如上述方法实施例的步骤，所述的存储介质，如：ROM/RAM、磁碟、光盘等。Through the description of the above embodiments, those of ordinary skill in the art can understand that all or part of the steps in the method of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable When the program is executed, the program includes the steps of the above-mentioned method embodiment, the storage medium described, such as: ROM/RAM, magnetic disk, optical disk, etc.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. An information extraction method, comprising:

extracting text block information from the layout file, wherein the text block information comprises: page text block information and manuscript text block information;

judging whether preset layout text block information in the text block information is extracted or not;

if the preset layout text block information is not extracted, extracting the preset layout text block information;

and if the preset layout text block information is extracted, extracting the preset manuscript text block information.

2. The information extraction method according to claim 1, characterized by further comprising:

and setting a regular expression matching rule of the preset layout text block information, a regular expression matching rule of the preset manuscript text block information and characteristic information of the preset manuscript text block information.

3. The information extraction method according to claim 2, wherein the step of extracting the preset layout text block information includes:

acquiring a regular expression matching rule of the preset layout text block information;

extracting the preset layout character block information from the layout character block information according to the regular expression matching rule of the preset layout character block information;

and setting the extraction identification of the preset layout text block information to be in an extracted state.

4. The information extraction method according to claim 3, wherein the step of extracting the preset layout text block information further comprises:

checking the preset layout text block information and giving a checking result;

and marking the verified preset layout text block information according to the verification result.

5. The information extraction method according to any one of claims 2 to 4, wherein the step of extracting the preset manuscript text block information comprises:

acquiring a regular expression matching rule of the preset manuscript text block information;

and extracting the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information.

6. The information extraction method according to any one of claims 2 to 4, wherein when the feature information of the preset manuscript text block information includes: during the font information, the step of extracting the preset manuscript text block information further comprises:

and acquiring the preset manuscript character block information set according to the character style information of the preset manuscript character block information.

7. The information extraction method according to claim 6, wherein when the feature information of the preset manuscript text block information further comprises: during the position information, the step of extracting the character block information of the preset manuscript further comprises the following steps:

preprocessing the preset manuscript character block information set;

extracting a subset of the preset manuscript text block information from the preset manuscript text block information set according to the position information;

and extracting the preset manuscript character block information from the subset of the preset manuscript character block information according to the regular expression matching rule of the preset manuscript character block information.

8. The information extraction method according to claim 7, wherein the step of extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to a regular expression matching rule of the preset manuscript text block information comprises:

carrying out information reprocessing on the subset of the preset manuscript text block information;

and extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.

9. The information extraction method according to claim 8, wherein the step of extracting the preset manuscript text block information further comprises:

checking the preset manuscript text block information and giving a checking result;

and marking the verified preset manuscript text block information according to the verification result.

10. An information extraction apparatus characterized by comprising:

a text block information extracting unit, configured to extract text block information from the layout file, where the text block information includes: page text block information and manuscript text block information;

the judging unit is used for judging whether the preset layout text block information in the text block information is extracted or not;

a preset layout extracting unit, configured to extract the preset layout text block information if the preset layout text block information is not extracted;

and the preset manuscript extracting unit is used for extracting the preset manuscript text block information if the preset layout text block information is extracted.

11. The information extraction apparatus according to claim 10, characterized by further comprising:

and the setting unit is used for setting the regular expression matching rule of the preset layout text block information, the regular expression matching rule of the preset manuscript text block information and the characteristic information of the preset manuscript text block information.

12. The information extraction apparatus according to claim 11, wherein the preset layout extraction unit includes:

a rule obtaining subunit, configured to obtain a regular expression matching rule of the preset layout text block information;

a preset layout extracting subunit, configured to extract the preset layout text block information from the layout text block information according to a regular expression matching rule of the preset layout text block information;

and the mark setting subunit is used for setting the extraction mark of the preset layout text block information to be in an extracted state.

13. The information extraction apparatus according to claim 12, wherein the preset layout extraction unit further includes:

the checking subunit is used for checking the preset layout text block information and giving a checking result;

and the identification subunit is used for identifying the verified preset layout text block information according to the verification result.

14. The information extraction apparatus according to any one of claims 11 to 13,

the preset manuscript extracting unit is also used for acquiring a regular expression matching rule of the preset manuscript text block information, and extracting the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information; or,

when the characteristic information of the preset manuscript text block information comprises: the preset manuscript extracting unit is also used for acquiring a preset manuscript block information set according to the font information of the preset manuscript block information and preprocessing the preset manuscript block information set when the font information and the position information are acquired; extracting a subset of the preset manuscript text block information from the preset manuscript text block information set according to the position information; and extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to the set regular expression matching rule of the preset manuscript text block information.

15. The information extraction apparatus according to claim 14, wherein the preset manuscript extraction unit includes:

the information reprocessing subunit is used for reprocessing the information of the subset of the preset manuscript text block information;

and the preset manuscript extracting subunit is used for extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.

16. The information extraction apparatus according to claim 15, wherein the preset manuscript extraction unit further comprises:

the checking subunit is used for checking the preset manuscript text block information and giving a checking result;

and the identification subunit is used for identifying the verified preset manuscript text block information according to the verification result.