CN102103612A - Information extraction method and device - Google Patents

Information extraction method and device Download PDF

Info

Publication number
CN102103612A
CN102103612A CN2009102430446A CN200910243044A CN102103612A CN 102103612 A CN102103612 A CN 102103612A CN 2009102430446 A CN2009102430446 A CN 2009102430446A CN 200910243044 A CN200910243044 A CN 200910243044A CN 102103612 A CN102103612 A CN 102103612A
Authority
CN
China
Prior art keywords
preset
block information
text block
manuscript
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009102430446A
Other languages
Chinese (zh)
Inventor
林欣欣
徐剑波
董宁
王辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN2009102430446A priority Critical patent/CN102103612A/en
Publication of CN102103612A publication Critical patent/CN102103612A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses an information extraction method and an information extraction device, relating to the technical field of information extraction, and aiming to solve the problem that in the prior art, the default text block information can not be extracted from the page information and manuscript information of the newspaper through automatic indexing. The information extraction method disclosed by the embodiment of the invention comprises the following steps: extracting text block information from a page file, wherein the text block information comprises page text block information and manuscript text block information; judging when the default page text block information in the text block information is extracted; if the default page text block information is not extracted, extracting the default page text block information; and if the default page text block information is extracted, extracting the default manuscript text block information. By using the method and device disclosed by the embodiment of the invention, the workload of the indexing personnel can be reduced, and the accuracy of indexing can be enhanced.

Description

Information extraction method and device
Technical Field
The present invention relates to the field of information extraction technologies, and in particular, to an information extraction method and apparatus.
Background
With the rapid development of the internet and information technology, the digital engineering of the newspaper publishing industry is also competitively developed. In the process of digitizing information in the newspaper publishing industry, the digitized information of newspaper resources has become a core digital asset of newspaper companies. The digital information of the newspaper resource comprises: manuscript information such as articles (text, paragraph, title, etc.) on the newspaper layout, text and picture contents in a table, etc.; the layout information includes newspaper layout, layout name, date, position information (such as coordinate information) of the manuscript, format information such as the font and the font size of the title and the text, and associated information of articles and pictures, pictures and text descriptions.
In order to completely and accurately store the digitized information of the newspaper resource as historical data for future inquiry, or accurately issue the digitized information of the newspaper resource in a cross-media way in real time through various digital media technologies, such as through a news website, a digital newspaper and a compact disc, the digitized information of the newspaper resource can be obtained by reversely resolving a layout file from the layout information of the newspaper through indexing software; then, indexing, modifying and checking the reversely solved newspaper digitalized information.
However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art: the automatic indexing of a computer adopted in the prior art can not extract preset text block information from the page text block information and the manuscript text block information of the newspaper, for example: the data information such as the name of a proofreader, the name of a format designer, the name of an author, the name of an editor and the like needs to be manually indexed one by an indexing person, so that the workload of the indexing person is large, and the accuracy is low.
Disclosure of Invention
The embodiment of the invention provides an information extraction method and device, which are used for automatically extracting preset text block information from the page text block information and the manuscript text block information of a newspaper.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in one aspect, an embodiment of the present invention provides an information extraction method, including:
extracting text block information from the layout file, wherein the text block information comprises: page text block information and manuscript text block information;
judging whether preset layout text block information in the text block information is extracted or not;
if the preset layout text block information is not extracted, extracting the preset layout text block information;
and if the preset layout text block information is extracted, extracting the preset manuscript text block information.
On the other hand, an embodiment of the present invention provides an information extraction apparatus, including:
a text block information extracting unit, configured to extract text block information from the layout file, where the text block information includes: page text block information and manuscript text block information;
the judging unit is used for judging whether the preset layout text block information in the text block information is extracted or not;
a preset layout extracting unit, configured to extract the preset layout text block information if the preset layout text block information is not extracted;
and the preset manuscript extracting unit is used for extracting the preset manuscript text block information if the preset layout text block information is extracted.
According to the information extraction method and device provided by the embodiment of the invention, the repeated extraction of the character block information of the same preset layout can be prevented by judging whether the character block information of the preset layout in the character block information is extracted or not; if the preset layout text block information is not extracted, extracting the preset layout text block information, thereby realizing the automatic extraction of the preset layout text block information; if the preset page text block information is extracted, the preset manuscript text block information is extracted, so that the automatic extraction of the preset manuscript text block information is realized.
Drawings
Fig. 1 is a flowchart of an information extraction method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an embodiment of an information extraction method according to the present invention;
fig. 3 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present invention.
Detailed Description
An information extraction method and an information extraction device provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, a method for extracting information according to an embodiment of the present invention is specifically implemented as follows:
101: extracting text block information from the layout file, wherein the text block information comprises: page text block information and manuscript text block information; the layout file can be understood as digital information reversely decoded by certain layout of the newspaper through indexing software. The extraction of the text block information from the layout file is to extract the text block information from the digitalized information of the newspaper layout.
102: judging whether preset layout text block information in the text block information is extracted or not;
103: if the preset layout text block information is not extracted, extracting the preset layout text block information;
104: and if the preset layout text block information is extracted, extracting the preset manuscript text block information.
According to the information extraction method and device provided by the embodiment of the invention, the repeated extraction of the character block information of the same preset layout can be prevented by judging whether the character block information of the preset layout in the character block information is extracted or not; if the preset layout text block information is not extracted, extracting the preset layout text block information, thereby realizing the automatic extraction of the preset layout text block information; if the preset page text block information is extracted, the preset manuscript text block information is extracted, so that the automatic extraction of the preset manuscript text block information is realized.
Based on the above embodiments, as shown in fig. 2, a flow chart is specifically implemented for an information extraction method provided by the embodiment of the present invention. When some preset layout text block information and preset manuscript text block information need to be extracted, the following procedures need to be carried out:
201: setting a regular expression matching rule of the preset layout text block information, a regular expression matching rule of the preset manuscript text block information and characteristic information of the preset manuscript text block information; the regular expression matching rule of the preset layout text block information and the regular expression matching rule of the preset manuscript text block information can be expressed in a regular expression form; the characteristic information of the preset manuscript text block information may include: font information and position information. Extracting the preset layout text block information from the text block information through the regular expression matching rule of the preset layout text block information; extracting the preset manuscript text block information from the text block information through a regular expression matching rule of the preset manuscript text block information; in order to more accurately acquire the preset manuscript text block information, firstly, the matching range of the preset manuscript text block information is acquired by reducing the characteristic information of the preset manuscript text block information, and then the preset manuscript text block information is matched in the range.
202: extracting text block information from the layout file, wherein the text block information comprises: page text block information and manuscript text block information;
203: judging whether preset layout text block information in the text block information is extracted or not;
204: if the preset layout text block information is not extracted, extracting the preset layout text block information; the specific implementation process is as follows:
s11: if the preset layout text block information is not extracted, acquiring a regular expression matching rule of the preset layout text block information; extracting the preset layout character block information from the layout character block information according to the regular expression matching rule of the preset layout character block information; the preset layout text block information can be an edit name, a proofreader name, a layout designer name and the like in the layout information; the regular expression matching rule of the preset layout text block information can be set according to the layout text block information which needs to be extracted specifically.
S12: and setting the extraction identification of the preset layout text block information to be in an extracted state.
It should be noted that, in order to ensure the accuracy of extracting the block information on the preset layout, the following operations may be performed on the extracted block information on the preset layout.
S13: checking the preset layout text block information and giving a checking result; the specific verification process is as follows: setting the preset layout character block information as an editing name in the layout character block information; the extracted editing name can be matched with a name in a pre-stored editing name library, and if the editing name exists in the editing name library, the extracted preset layout text block information is considered to be correct, namely the verification result is 100% correct; and if the extracted edit name is partially matched with the name in the pre-stored edit name library or is not matched with the name in the pre-stored edit name library, giving a correct rate according to the matching state, namely the verification result is 50% correct or 0% correct.
S14: and marking the verified preset layout text block information according to the verification result. For example: marking 100% correct preset layout text block information as white; marking 50% correct preset layout text block information as yellow; and marking the 0% correct preset layout text block information as red.
205: if the preset layout text block information is extracted, extracting the preset manuscript text block information; the specific implementation process can be as follows:
if the preset layout text block information is extracted, acquiring a regular expression matching rule of the preset manuscript text block information; and extracting the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information.
In order to extract the preset layout text block information more accurately, the process of extracting the preset manuscript text block information in the embodiment of the invention can be realized by the following process: setting the following extracted preset manuscript text block information as an author name;
s21: when the characteristic information of the preset manuscript text block information comprises: and during font information, if the preset layout text block information is extracted, acquiring the preset manuscript text block information set according to the font information of the preset manuscript text block information. For example: the font information is set as: a black body; if the preset layout text block information is extracted, extracting all text block information with black fonts in the manuscript text block information, and combining the extracted information into a preset manuscript text block information set { T }.
In order to further accurately acquire the preset manuscript text block information, the embodiment of the present invention may further include, by setting the feature information: the position information further reduces the range of obtaining the preset manuscript text block information; after a preset manuscript text block information set { T } is obtained, the following operations are continued:
s22: when the feature information of the preset manuscript text block information further comprises: preprocessing the preset manuscript character block information set during position information, and respectively obtaining a preset manuscript character block information set { Ts } and a preset manuscript character block information set { Te }; for example: the position information is set as follows: the position Ps from the beginning of the preset manuscript character block information set content to the first appearance reference symbol; and/or, the end of the content of the preset manuscript text block information set reaches the position Pe of the last appearing reference character.
The process of preprocessing the preset manuscript text block information set { T } may specifically include: the problem of inconsistent parentheses in the content T to be extracted caused by the problem of inconsistent font descriptions possibly exists in the preset manuscript text block information set { T }.
S23: extracting a subset { A } of the preset manuscript text block information from the preset manuscript text block information set { T } according to the position information; specifically, according to the position information Ps, corresponding information a1 may be first extracted from the preset manuscript text block information set { Ts }, and if a1 is extracted, a1 may be used as the subset { a }; if a1 is not extracted, extracting corresponding information a2 from the preset manuscript text block information set { Te } according to the position information Pe, and taking a2 as a subset { A }.
S24: extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to the set regular expression matching rule of the preset manuscript text block information; setting the matching level number of the regular expression matching rule of the preset manuscript text block information to be 4; the number of regular expression matching rules of the matching level 1 is 3, the number of regular expression matching rules of the matching level 2 is 3, the number of regular expression matching rules of the matching level 3 is 2, and the number of regular expression matching rules of the matching level 4 is 1; the spacer is comma or semicolon; the regular expression matching rules of all the matching levels form a matching set; the step may specifically include:
acquiring regular expression matching rules corresponding to the matching levels from the matching sets in sequence according to the matching levels; the regular expression matching rule description mode is a regular expression. The method comprises the following steps:
firstly, acquiring 3 regular expression matching rules corresponding to a matching level 1 from a matching set; the rule is as follows:
rule 1 may be: Λ (?/g;
the above regular expression represents full text matching "(", and matches "non-carriage return symbol" zero to infinite times, and matches "reporter" or "reporter group" or "author" or "intern" or "correspondent" or "text/pickup" or "text/figure" or "illustration" or "cartoon" or "drawing" or "intern" or "/text" or "commentator" or "comment", and matches "non-carriage return symbol" zero to infinite times, and matches ")".
Rule 2 may be: Λ (\\ s ([ \ u4e00- \ u9fa5] {2, 5} \ s + [ \ u4e00- \ u9fa5] {2, 5} \\ s) + \ g;
the above regular expression represents full-text matching "(", and matches "blank character" zero to infinite times, and matches 2 to 5 chinese characters, and matches one "blank character", and matches 2 to 5 chinese characters, and matches "blank character" zero to infinite times, and matches ")".
Rule 3 may be: (reporter | reporter group | author | practice life | communicator | practice reporter | review person | cartography | caricature | illustration | draft) (| v | v [ \ u4e00- \ u9fa5] {2, 6} \ s (? (| n | take up | document | from | comprehensive report | document | take up | v | drawing | document and take up | photography report | drawing | extract integrate | take up | document | [ \ u4e00- \ u9fa5] {2, 5} special electricity | photography | document | map | report | writing | \ (present document | u4e00- \\ u9fa 2 [ \\ 387 3 | text >) | 3 h) | } document | writing | (present document 4e00- \\\\ u9fa 2) | 3 b3 | text >) | 3,387 3 | document | 3 h;
the regular expression "(reporter group | writer | intern | correspondent | intern | reviewer | cartographic | caricature | draft)" described above means matching "reporter" or "reporter group" or "author" or "intern" or "correspondent" or "intern" or "commenter" or "cartographic" or "illustration" or "draft";
the regular expression "(" \ s "|)" v \u4e00- \u9fa5] {2, 6} \ s "-" represents matching ": or" blank character "or"/"zero to infinite times, and matching 2 to 6 chinese characters, and matching" blank character "zero to infinite times;
the above regular expression "(;
the regular expression "($ n \ v \ | text \ | sent from | comprehensive report | text \ | v \ | draw | text and | photography report | v \ | draw | extract integrated | take \ | [ \ u4e00- \ u9fa5] {2, 5} special electric | photography | text \ | map | report | pick up and write \ | of this edition [ \\ u4e00- \ u9fa5] } electric \ | text [ \ u4e00- \ u9fa5] } is a suffix content, i.e., a string end or a" trailer return symbol "follows a matching position, or matches any of the following characters: "/take,"/text, "" issue, "" comprehensive report, "" text/take, "/draw," "text-by-text," "photographic report," "" "/draw," "sort," "extract integrate," "take," "photograph," "text/figure," "report," "pick up," or match "the newspaper" followed immediately by more than one Chinese character and finally followed immediately by "electricity" or match "the edition" followed immediately by more than one Chinese character;
the above "/g" indicates that all matching characters occur for full text search.
Secondly, acquiring 3 regular expression matching rules corresponding to the matching level 2 from the matching set; the rule is as follows:
rule 1: v (|, | a | \ r | \ n | v-document-taking up | v-document-drawing | document and v-drawing | drawing | document v-document-figure | v-document arrangement | v-document exercise life | v-document \ s | v-document \ | [ u4e00- \ u9fa5] {2, 4} < s (? [ | document-drawing | document and v-drawing | document-v-document exercise life))/g;
the regular expression "(\\ or |, |? "or"! "or" \\ r "or" \ n "or"/draw "or" text/drawing "or" text/figure "or"/word arrangement "or"/trainee "or"/text ";
the regular expression "s \ [ \ u4e00- \u9fa5] {2, 4} \ s \" above means matching "blank characters" zero to infinite times, and matching 2 to 4 chinese characters, and matching "blank characters" zero to infinite times;
the above regular expression "(;
the regular expression "((v. n.) | text/V. v; last immediately following ")" indicates that the suffix ends;
the above "/g" indicates that all matching characters occur for full text search.
Rule 2: (reporter | reporter group | author | internist | correspondent | internist | commenting member) (\ s | v-v) + [ \ u4e00- \ u9fa5] {2, 4} (\ s + [ \ u4e00- \ u9fa5] {2, 6}) {1, } \ s (;
the above regular expression "(reporter | reporter group | author | intern | correspondent | intern | reviewer)" means: matching strings "reporter" or "reporter group" or "author" or "internist" or "correspondent" or "internist" or "commenter";
the regular expression "(\ s \ |)" above indicates that the "blank character" is matched zero or more times, or that "/"; wherein "+" represents and matches "(\ s \ |)" more than once;
the regular expression "[ \\ u4e00- \ u9fa5] {2, 4 }" described above indicates that 2 to 4 Chinese characters are matched;
the regular expression "(\\ s + [ \ u4e00- \ u9fa5] {2, 6}) {1, }" indicates that "(\ s + [ \ u4e00- \ u9fa5] {2, 6 }" indicates that "blank characters" are repeatedly matched more than once, 2 as large as 6 Chinese characters are matched, "{ 1, }" indicates that "\\ s + [ \ u4e00- \ u9fa5] {2, 6 }" is repeatedly matched more than once;
the regular expression "\ s" above indicates that the "blank character" is repeatedly matched zero or more times;
the above regular expression "(;
the regular expression "($ n \/take \/from | integrated report | v \/take \/draw | text and take | photography report | v \/draw | finish | take \/u 4e00- \/u 9fa5] {2, 5} private tv | photography | text graph | report | pick up |)" is the suffix content immediately above, indicating that the matching position is a character string ending or a carriage return symbol or a "/take \" or "from" or "integrated report \" or "text/take \/draw \" or "text and take \" or "photography report \" or "channel photography \/draw \" or "finish \" or "chinese take \" or 2 to 5 characters followed by "private tv \" or "photography \" or "text/draw \" or "report | or" pick up \ "or" report "; last immediately following ")" indicates that the suffix ends;
the above "/g" indicates that all matching characters occur for full text search.
Rule 3: v. (| ● | □ |. circa) \ s;
the above regular expression "(. diamond. | ●. □. circa.). cndot." indicates that the string matches ". diamond." or "●" or "□" or ". circa.";
the regular expression "\\ s.. the \" indicates that the blank characters are repeatedly matched for zero times or more, and the non-line-feed characters are repeatedly matched for zero times or more;
the above regular expression "(;
the regular expression "($ r \\ n)" represents the suffix content, and the matching string ends or the return line break is immediately followed ")" represents the suffix end;
the above "/g" indicates that all matching characters occur for full text search.
Thirdly, obtaining 2 regular expression matching rules corresponding to the matching level 3 from the matching set; the rule is as follows:
rule 1: Λ (\\ s [ \ u4e00- \ u9fa5] {2, 4} (\ s + [ \ u4e00- \ u9fa5] {2, 6}) {1, } \ s \ g;
the regular expression "\\ s \ u4e00 \u9fa5] {2, 4 }" described above means that the "blank character" is repeatedly matched zero or more times, matching 2 to 4 chinese characters;
the regular expression "(\\ s + [ \ u4e00- \ u9fa5] {2, 6 })" indicates that the blank character is matched more than once, and 2 to 6 Chinese characters are matched;
the regular expression "{ 1, }" indicates that "(\ s + [ \ u4e00- \ u9fa5] {2, 6 })" is matched more than once;
the regular expression "\ s" above indicates that the "blank character" is repeatedly matched zero or more times;
the above "/g" represents all matching characters appearing in full text search;
rule 2: re ═ Λ (\ s [ \ u4e00- \ u9fa5] {2, 4} \ s \ g;
the regular expression represents that the 'blank character' is repeatedly matched for zero times or more, 2 to 4 Chinese characters are matched, and the 'blank character' is repeatedly matched for zero times or more;
wherein, "/g" represents all matching characters appearing in the full text search;
finally, 1 regular expression matching rule corresponding to the matching level 4 is obtained from the matching set; the rule is as follows:
rule 1: v (\ s + | | \;
the above regular expression "(\\ s + | | \? "or". "or"! ";
the regular expression "[ \\ u4e00- \ u9fa5] {2, 4} \ s } above represents matching 2 to 4 chinese characters, repeating matching" blank characters "zero or more times;
the above regular expression "(;
the regular expression "((v. [ \ n.) ] V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.V.; last immediately following ")" indicates that the suffix ends;
where "/g" represents all matching characters that occur in a full text search.
According to the obtained regular expression matching rule, performing content matching on the content in the subset of the preset manuscript text block information to give a matching result; for example: matching the 3 regular expression matching rules of the matching level 1 with the content in the subset of the preset manuscript text block information, so as to extract 'author, Wanyi', adding the 'author, Wanyi' into a set { B }, and then continuously acquiring the 3 regular expression matching rules of the matching level 2 to match with the content in the subset of the preset manuscript text block information without extracting any information; then, 2 regular expression matching rules of the matching level 3 are obtained to be matched with the content in the subset of the preset manuscript text block information, a communicator, Zhao II is extracted, and the communicator, Zhao II is added into a set { B }; finally, obtaining 1 regular expression matching rule of matching level 4 to match with the content in the subset of the preset manuscript text block information, extracting 'edit Zhang III', and adding the 'edit Zhang III' into a set { B }; the set { B } is { author, Wang Yi, correspondent, Zhao Di, edit Zhang III }.
When the set { B } is acquired as { author, Wang Yi, correspondent, Zhao II, edit Zhang III }, keyword filtering can be performed on a matching result according to a corresponding filtering rule to obtain an author name 'Wang Yi', and the name is extracted into an author set { B1 }; sequentially extracting the name 'Zhao-Di' of the correspondent to a name set { B2} of the correspondent; the edit name "Zhang three" is extracted into an edit name set { B3 }. The keyword filtering process completes the keyword removal process, and the keywords are such as 'author', 'edit', 'correspondent', etc.
It should be noted that, since there may exist a plurality of results separated by a specific punctuation mark (e.g. comma, semicolon) in the results obtained by keyword filtering, such as { wanyi, zhao, zhang san }, re-extraction of the result set is required. Cutting the character string to obtain a plurality of results by taking the specific punctuations as spacers, for example, adding 'Wanyi' into the result set { A1 }; add "Zhao two" to the result set { A2 }; "Zhang three" is added to the result set { A3}.
It should be noted that the matching level can be optimized according to experimental statistics.
The regular expression matching rules are all expressed in a regular expression mode and are formed by combining a plurality of key words. The description of the relevant parameters can be specifically seen, and the configuration can be also configured according to different specific examples. Each regular expression matching rule corresponds to a keyword replacement rule. The multi-level rule setting can extract all authors to the maximum extent; the name of the reporter, the name of the communicator, the name of the camera, the name of the gathering and editing, the name of the intern, the name of the word arrangement, the name of the commentator and the like.
S25: carrying out information reprocessing on the subset of the preset manuscript text block information; the specific implementation process of the step can comprise: the merging of the result set { a1}, { a2}, { A3}. to result set { a }; and then, performing keyword secondary filtering for duplicate elimination and omission processing on the result set { A }. Specifically, the information items with the same content in the result set { A } are removed, and the result set { A } is subjected to keyword filtering again.
S26: and extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.
It should be noted that the method further comprises:
s27: checking the preset manuscript text block information and giving a checking result; the specific verification process can utilize the prestored dictionary information to verify and extract the accuracy of the preset manuscript text block information, namely the author name set { A }, and the steps are as follows:
step 1: and sequentially acquiring an author A, comparing the author A with the established author name dictionary, checking whether all the author A exist, and identifying that the accuracy rate of the author set { A } is 100% if all the author A exist. For some partial matches, or no match at all, a 60%, 0 accuracy is identified for the author set { A }, respectively.
Step 2: setting a Chinese surname dictionary with the coverage rate of 95%, carrying out secondary accuracy calculation on an author set with the accuracy rate of not 100%, obtaining a first character of an author character string, comparing the first character with the surname dictionary, and if the first character exists, improving the accuracy. If not, acquiring the first two characters of the author character string, comparing with the surname dictionary, if so, improving the accuracy, otherwise, reducing.
S28: and marking the verified preset manuscript text block information according to the verification result.
As shown in fig. 3, an information extracting apparatus provided in an embodiment of the present invention includes:
a text block information extracting unit 301, configured to extract text block information from the layout file, where the text block information includes: page text block information and manuscript text block information;
a judging unit 302, configured to judge whether preset layout text block information in the text block information is extracted;
a preset layout extracting unit 303, configured to extract the preset layout text block information if the preset layout text block information is not extracted;
a preset manuscript extracting unit 304, configured to extract preset manuscript text block information if the preset layout text block information has been extracted.
It should be noted that the apparatus further comprises:
and the setting unit is used for setting the regular expression matching rule of the preset layout text block information, the regular expression matching rule of the preset manuscript text block information and the characteristic information of the preset manuscript text block information.
It should be noted that the preset layout extracting unit 303 includes:
a rule obtaining subunit, configured to obtain a regular expression matching rule of the preset layout text block information;
a preset layout extracting subunit, configured to extract the preset layout text block information from the layout text block information according to a regular expression matching rule of the preset layout text block information;
and the mark setting subunit is used for setting the extraction mark of the preset layout text block information to be in an extracted state.
It should be further noted that the preset layout extracting unit 303 further includes:
the checking subunit is used for checking the preset layout text block information and giving a checking result;
and the identification subunit is used for identifying the verified preset layout text block information according to the verification result.
It should be further noted that the preset manuscript extraction unit 304 is further configured to obtain a regular expression matching rule of the preset manuscript text block information, and extract the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information; or,
when the characteristic information of the preset manuscript text block information comprises: when the font information and the position information are obtained, the preset manuscript extracting unit 304 is further configured to obtain the preset manuscript text block information set according to the font information of the preset manuscript text block information, and obtain the preset manuscript text block information set according to the font information of the preset manuscript text block information; preprocessing the preset manuscript character block information set; extracting a subset of the preset manuscript text block information from the preset manuscript text block information set according to the position information; and extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to the set regular expression matching rule of the preset manuscript text block information.
It should be further noted that the preset contribution extracting unit 304 includes:
the information reprocessing subunit is used for reprocessing the information of the subset of the preset manuscript text block information;
and the preset manuscript extracting subunit is used for extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.
It should be further noted that the preset contribution extracting unit 304 further includes:
the checking subunit is used for checking the preset manuscript text block information and giving a checking result;
and the identification subunit is used for identifying the verified preset manuscript text block information according to the verification result.
According to the information extraction method and device provided by the embodiment of the invention, the repeated extraction of the character block information of the same preset layout can be prevented by judging whether the character block information of the preset layout in the character block information is extracted or not; if the preset layout text block information is not extracted, extracting the preset layout text block information, thereby realizing the automatic extraction of the preset layout text block information; if the preset page text block information is extracted, the preset manuscript text block information is extracted, so that the automatic extraction of the preset manuscript text block information is realized. Compared with the prior art, the embodiment of the invention not only can automatically extract the preset layout text block information and the preset manuscript text block information, but also can further compare the prestored library information with the extracted preset layout text block information and the preset manuscript text block information, thereby improving the accuracy of extracting the preset layout text block information and the preset manuscript text block information, greatly reducing the workload of indexing personnel and improving the extraction accuracy. In the process of extracting the preset manuscript text block information, the range of extracting the preset manuscript text block information is narrowed through the characteristic information, and the accuracy of extracting the preset manuscript text block information is further improved.
Through the above description of the embodiments, one of ordinary skill in the art can understand that: all or part of the steps of the method for implementing the above embodiment may be implemented by a program instructing associated hardware, where the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the above method embodiment, and the storage medium includes, for example: ROM/RAM, magnetic disk, optical disk, etc.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (16)

1. An information extraction method, comprising:
extracting text block information from the layout file, wherein the text block information comprises: page text block information and manuscript text block information;
judging whether preset layout text block information in the text block information is extracted or not;
if the preset layout text block information is not extracted, extracting the preset layout text block information;
and if the preset layout text block information is extracted, extracting the preset manuscript text block information.
2. The information extraction method according to claim 1, characterized by further comprising:
and setting a regular expression matching rule of the preset layout text block information, a regular expression matching rule of the preset manuscript text block information and characteristic information of the preset manuscript text block information.
3. The information extraction method according to claim 2, wherein the step of extracting the preset layout text block information includes:
acquiring a regular expression matching rule of the preset layout text block information;
extracting the preset layout character block information from the layout character block information according to the regular expression matching rule of the preset layout character block information;
and setting the extraction identification of the preset layout text block information to be in an extracted state.
4. The information extraction method according to claim 3, wherein the step of extracting the preset layout text block information further comprises:
checking the preset layout text block information and giving a checking result;
and marking the verified preset layout text block information according to the verification result.
5. The information extraction method according to any one of claims 2 to 4, wherein the step of extracting the preset manuscript text block information comprises:
acquiring a regular expression matching rule of the preset manuscript text block information;
and extracting the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information.
6. The information extraction method according to any one of claims 2 to 4, wherein when the feature information of the preset manuscript text block information includes: during the font information, the step of extracting the preset manuscript text block information further comprises:
and acquiring the preset manuscript character block information set according to the character style information of the preset manuscript character block information.
7. The information extraction method according to claim 6, wherein when the feature information of the preset manuscript text block information further comprises: during the position information, the step of extracting the character block information of the preset manuscript further comprises the following steps:
preprocessing the preset manuscript character block information set;
extracting a subset of the preset manuscript text block information from the preset manuscript text block information set according to the position information;
and extracting the preset manuscript character block information from the subset of the preset manuscript character block information according to the regular expression matching rule of the preset manuscript character block information.
8. The information extraction method according to claim 7, wherein the step of extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to a regular expression matching rule of the preset manuscript text block information comprises:
carrying out information reprocessing on the subset of the preset manuscript text block information;
and extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.
9. The information extraction method according to claim 8, wherein the step of extracting the preset manuscript text block information further comprises:
checking the preset manuscript text block information and giving a checking result;
and marking the verified preset manuscript text block information according to the verification result.
10. An information extraction apparatus characterized by comprising:
a text block information extracting unit, configured to extract text block information from the layout file, where the text block information includes: page text block information and manuscript text block information;
the judging unit is used for judging whether the preset layout text block information in the text block information is extracted or not;
a preset layout extracting unit, configured to extract the preset layout text block information if the preset layout text block information is not extracted;
and the preset manuscript extracting unit is used for extracting the preset manuscript text block information if the preset layout text block information is extracted.
11. The information extraction apparatus according to claim 10, characterized by further comprising:
and the setting unit is used for setting the regular expression matching rule of the preset layout text block information, the regular expression matching rule of the preset manuscript text block information and the characteristic information of the preset manuscript text block information.
12. The information extraction apparatus according to claim 11, wherein the preset layout extraction unit includes:
a rule obtaining subunit, configured to obtain a regular expression matching rule of the preset layout text block information;
a preset layout extracting subunit, configured to extract the preset layout text block information from the layout text block information according to a regular expression matching rule of the preset layout text block information;
and the mark setting subunit is used for setting the extraction mark of the preset layout text block information to be in an extracted state.
13. The information extraction apparatus according to claim 12, wherein the preset layout extraction unit further includes:
the checking subunit is used for checking the preset layout text block information and giving a checking result;
and the identification subunit is used for identifying the verified preset layout text block information according to the verification result.
14. The information extraction apparatus according to any one of claims 11 to 13,
the preset manuscript extracting unit is also used for acquiring a regular expression matching rule of the preset manuscript text block information, and extracting the preset manuscript text block information from the layout text block information according to the regular expression matching rule of the preset manuscript text block information; or,
when the characteristic information of the preset manuscript text block information comprises: the preset manuscript extracting unit is also used for acquiring a preset manuscript block information set according to the font information of the preset manuscript block information and preprocessing the preset manuscript block information set when the font information and the position information are acquired; extracting a subset of the preset manuscript text block information from the preset manuscript text block information set according to the position information; and extracting the preset manuscript text block information from the subset of the preset manuscript text block information according to the set regular expression matching rule of the preset manuscript text block information.
15. The information extraction apparatus according to claim 14, wherein the preset manuscript extraction unit includes:
the information reprocessing subunit is used for reprocessing the information of the subset of the preset manuscript text block information;
and the preset manuscript extracting subunit is used for extracting the preset manuscript text block information from the reprocessed subset of the preset manuscript text block information.
16. The information extraction apparatus according to claim 15, wherein the preset manuscript extraction unit further comprises:
the checking subunit is used for checking the preset manuscript text block information and giving a checking result;
and the identification subunit is used for identifying the verified preset manuscript text block information according to the verification result.
CN2009102430446A 2009-12-22 2009-12-22 Information extraction method and device Pending CN102103612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102430446A CN102103612A (en) 2009-12-22 2009-12-22 Information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102430446A CN102103612A (en) 2009-12-22 2009-12-22 Information extraction method and device

Publications (1)

Publication Number Publication Date
CN102103612A true CN102103612A (en) 2011-06-22

Family

ID=44156389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102430446A Pending CN102103612A (en) 2009-12-22 2009-12-22 Information extraction method and device

Country Status (1)

Country Link
CN (1) CN102103612A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841888A (en) * 2012-09-14 2012-12-26 《中国学术期刊(光盘版)》电子杂志社 Rapid typesetting system and method
CN102982027A (en) * 2011-09-02 2013-03-20 北大方正集团有限公司 Method and device for abstracting contents in document
CN103425651A (en) * 2012-05-15 2013-12-04 北大方正集团有限公司 Method and equipment for detecting data integrity
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN106933783A (en) * 2015-12-31 2017-07-07 远光软件股份有限公司 A kind of method and device on the intelligent extraction date from text
CN107633074A (en) * 2017-09-22 2018-01-26 咪咕文化科技有限公司 Information extraction method and device and storage medium
CN109299737A (en) * 2018-09-19 2019-02-01 语联网(武汉)信息技术有限公司 Choosing method, device and the electronic equipment of interpreter's gene

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912874A (en) * 2006-08-30 2007-02-14 北京大学 Method for abstracting document data information appeared in newspaper
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912874A (en) * 2006-08-30 2007-02-14 北京大学 Method for abstracting document data information appeared in newspaper
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982027A (en) * 2011-09-02 2013-03-20 北大方正集团有限公司 Method and device for abstracting contents in document
CN103425651A (en) * 2012-05-15 2013-12-04 北大方正集团有限公司 Method and equipment for detecting data integrity
CN103425651B (en) * 2012-05-15 2017-10-24 北大方正集团有限公司 A kind of method and apparatus of data integrity detection
CN102841888A (en) * 2012-09-14 2012-12-26 《中国学术期刊(光盘版)》电子杂志社 Rapid typesetting system and method
CN102841888B (en) * 2012-09-14 2015-10-14 《中国学术期刊(光盘版)》电子杂志社有限公司 A kind of composing system and method fast
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN104679875B (en) * 2015-03-10 2017-12-15 杭州凡闻科技有限公司 A kind of information data classification method based on digital newspaper
CN106933783A (en) * 2015-12-31 2017-07-07 远光软件股份有限公司 A kind of method and device on the intelligent extraction date from text
CN107633074A (en) * 2017-09-22 2018-01-26 咪咕文化科技有限公司 Information extraction method and device and storage medium
CN107633074B (en) * 2017-09-22 2020-06-09 咪咕文化科技有限公司 Information extraction method and device and storage medium
CN109299737A (en) * 2018-09-19 2019-02-01 语联网(武汉)信息技术有限公司 Choosing method, device and the electronic equipment of interpreter's gene
CN109299737B (en) * 2018-09-19 2021-10-26 语联网(武汉)信息技术有限公司 Translator gene selection method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN102103612A (en) Information extraction method and device
CN101241514B (en) Method for creating error-correcting database, automatic error correcting method and system
CN104462085B (en) Search key error correction method and device
CN102053991B (en) Method and system for multi-language document retrieval
CN110909548A (en) Chinese named entity recognition method and device and computer readable storage medium
CN111259645A (en) Referee document structuring method and device
JP2009506394A5 (en)
CN105404903B (en) Information processing method and device and electronic equipment
CN102346748A (en) Automatic identification method for network literature directory type web pages
CN105488471A (en) Character pattern recognition method and device
CN109472020B (en) Feature alignment Chinese word segmentation method
WO2015024429A1 (en) Method and device for acquiring movie and television subject from webpage
CN107977435B (en) Text information preprocessing method and device
KR102015454B1 (en) Method for automatically editing pattern of document
CN101673263A (en) Method for searching video content
CN111126201B (en) Character recognition method and device in script
CN110955796B (en) Case feature information extraction method and device based on stroke information
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN105320716A (en) Automatic labeling method for digital publication
JPS63244259A (en) Keyword extractor
KR101800975B1 (en) Sharing method and apparatus of the handwriting recognition is generated electronic documents
CN110717091B (en) Entry data expansion method and device based on face recognition
JP7105500B2 (en) Computer-implemented Automatic Acquisition Method for Element Nouns in Chinese Patent Documents for Patent Documents Without Intercharacter Spaces
TWI608415B (en) Electronic data retrieval system and method
CN114222193B (en) Video subtitle time alignment model training method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110622