CN101354727A - Method and apparatus for establishing links between digital document catalog and text - Google Patents

Method and apparatus for establishing links between digital document catalog and text Download PDF

Info

Publication number
CN101354727A
CN101354727A CNA2008102227847A CN200810222784A CN101354727A CN 101354727 A CN101354727 A CN 101354727A CN A2008102227847 A CNA2008102227847 A CN A2008102227847A CN 200810222784 A CN200810222784 A CN 200810222784A CN 101354727 A CN101354727 A CN 101354727A
Authority
CN
China
Prior art keywords
page
information
confidence
degree
entry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102227847A
Other languages
Chinese (zh)
Other versions
CN101354727B (en
Inventor
高良才
褚一民
陶欣
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Founder Apabi Technology Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CN2008102227847A priority Critical patent/CN101354727B/en
Publication of CN101354727A publication Critical patent/CN101354727A/en
Application granted granted Critical
Publication of CN101354727B publication Critical patent/CN101354727B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a device for establishing the linkage between a catalogue of a digital file and the body. The invention provides the method for automatically establishing the linkage between the catalogue of the digital file and the body so as to improve the efficiency for establishing the linkage between the catalogue of the digital file and the body. The method comprises the following steps: at least one piece of information of catalogue items is obtained from each piece of stored information of the catalogue items , according to at least one piece of the information of the catalogue items, each logic page corresponding to each catalogue item is determined in the digital file; and the linkage between each catalogue item and each corresponding logic page is established. The proposal provided by the invention effectively improves the establishing efficiency of the linkage between the catalogue of the digital file and the body by automatically establishing the linkage between the catalogue of the digital file and the body, thereby increasing the manufacture speed of the digital file.

Description

A kind of method and device that links between digital document catalog and the text of setting up
Technical field
The present invention relates to the document processing technology field, relate in particular to a kind of method and device that links between digital document catalog and the text of setting up.
Background technology
The catalogue of document can be used for index file, is convenient to reader's retrieval and reading.In digital document, the general reader wishes can be by clicking certain directory entry, just can jump to body part that should directory entry, searches the speed of content and the speed of reading thereby can improve the reader.
Have colophon in the front of document or the center section of chapters and sections in the paper document, catalogue page, preface, preface, cross mountains, the content of non-body part such as appendix and reference documents, in the process of concrete page-number marker, general all is that each content all has page number ordering separately, for example have 5 pages for catalogue page, page number sort method according to himself is respectively first page to the 5th page, body part is since the 6th page, but according to this page of ordering of its body part be first page of body part, and the page number of this body part correspondence that writes down in the catalogue also is first page, but this page number is not represented its real logic page number.
At paper document process optical character recognition (Optical Character Recognition, OCR) be converted into digital document, perhaps by document process software, Adobe Acrobat or the Founder software for composing etc. of soaring for example, after directly generating digital document, what digital document write down all is the logical page (LPAGE) of each page, and promptly as a whole with digital document, every page in this whole position.Therefore, the nature page or leaf of this content of each the catalogue entry mark that writes down in the catalogue of digital document, and do not have corresponding relation between the logical page (LPAGE) of digital document.Linking between the catalogue page of setting up digital document in the prior art and the text generally all is by finishing manually, and efficient speed low, that link is set up is slow, and accuracy rate is not high yet.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of method and device that links between digital document catalog and the text of setting up, and links in order to set up automatically between digital document catalog and the text, improves the efficient that link foundation between digital document catalog and the text.
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, comprising:
From each catalogue entry of preserving, obtain at least one directory entry information,, in digital document, determine each logical page (LPAGE) of each catalogue entry correspondence according to described at least one directory entry information;
Set up linking between each catalogue entry and corresponding each logical page (LPAGE).
The embodiment of the invention provides a kind of device that links between digital document catalog and the text of setting up, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, comprising:
The logical page (LPAGE) identification module is used for obtaining at least one directory entry information from each catalogue entry of preserving, according to described at least one directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document;
Module is set up in link, is used to set up linking between each catalogue entry and corresponding each logical page (LPAGE).
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, can be by obtaining at least one directory entry information according to the catalogue entry information of preserving, this at least one directory entry information is mated at the page of digital document, result according to coupling determines each logical page (LPAGE) of each catalogue entry correspondence, thereby sets up linking between each catalogue entry and this logical page (LPAGE).Adopt this to set up the method that links between digital document catalog and the text automatically, what link between the catalogue that can improve digital document effectively and the text sets up efficient, and then improves the make efficiency of digital document.
Description of drawings
A kind of method flow diagram that links between digital document catalog and the text of setting up automatically that Fig. 1 provides for the embodiment of the invention;
A kind of concrete implementing procedure figure of method that links between digital document catalog and the text that sets up that Fig. 2 provides for the embodiment of the invention;
Fig. 3 determines the method flow diagram of each logical page (LPAGE) of each catalogue entry correspondence for what the embodiment of the invention provided according to the page number directory entry in the catalogue entry;
Definite character coordinates synoptic diagram that Fig. 4 provides for the embodiment of the invention;
Fig. 5 determines the method flow diagram of each logical page (LPAGE) of each catalogue entry correspondence for what the embodiment of the invention provided according to the head table item in the catalogue entry;
The concrete concrete implementing procedure figure of method that links between digital document catalog and the text that sets up that Fig. 6 A provides for the embodiment of the invention;
Fig. 6 B determines the method flow diagram of logical page (LPAGE) for what the embodiment of the invention provided according to page number directory entry information and head table item information;
The catalogue page of the digital document that Fig. 7 provides for the embodiment of the invention;
The text page of the digital document that Fig. 8 provides for the embodiment of the invention;
A kind of structure drawing of device that links between digital document catalog and the text of setting up that Fig. 9 provides for the embodiment of the invention.
Embodiment
Automatic foundation in order to realize linking between digital document catalog and the text in embodiments of the present invention, improve the efficient that link foundation between digital document catalog and the text, as shown in Figure 1, a kind of method that links between digital document catalog and the text of setting up is provided, wherein said digital document catalog comprises a plurality of catalogue entries, each catalogue entry comprises at least one directory entry information, specifically may further comprise the steps:
S101: from each catalogue entry of preserving, obtain at least one directory entry information,, in digital document, determine each logical page (LPAGE) of each catalogue entry correspondence according to described at least one directory entry information.
Wherein, this at least one directory entry information of obtaining comprises: page number directory entry information and/or head table item information.
S102: set up linking between each catalogue entry and corresponding each logical page (LPAGE).
Below in conjunction with accompanying drawing the embodiment of the invention is described in detail.
The data file that is adopted in embodiments of the present invention, can read this digital document by page or leaf, and can obtain the character of every page of digital document, and can obtain each character at every page coordinate information, can identify simultaneously the font information of literal, be the font type of literal, information such as font size.
As shown in Figure 2, for setting up the method that links between digital document catalog and the text in the embodiment of the invention, specifically may further comprise the steps:
S201: read in digital document, obtain each catalogue entry information of preservation.
The catalogue entry of preserving is, according to the information of the catalogue of the digital document of identification, with each row in the catalogue as a catalogue entry, comprise in this catalogue entry: chapters and sections sequence number directory entry, be the chapters and sections serial number information of this catalogue row representative, for example, chapter 2, protelum etc.; Or the head table item, be the heading message of this catalogue row representative, i.e. Word message after the chapters and sections sequence number, before the page number information; Or page number directory entry, be the nature page or leaf at this chapters and sections place in this catalogue row.
S202:, determine each logical page (LPAGE) of each catalogue entry correspondence according at least one the directory entry information in each catalogue entry.
Wherein at least one the directory entry information in each catalogue entry comprises: the page number directory entry information in the catalogue entry or according to the head table item information in the catalogue entry, perhaps both combinations.
S203: set up linking between each catalogue entry and corresponding each logical page (LPAGE).
As shown in Figure 3, for the embodiment of the invention provide according to the page number directory entry information in the catalogue entry, determine the method for each logical page (LPAGE) of each catalogue entry correspondence, specifically may further comprise the steps:
S301:, determine candidate's page or leaf to logical page (LPAGE) place that should the page number according to the page number information in the page number directory entry in each catalogue entry.
According to the candidate's page or leaf at the logical page (LPAGE) place of presetting and the relation of the page number, determine candidate's page or leaf at the logical page (LPAGE) place of this page number, wherein this specification that presets is according to the page number information of each catalogue entry, candidate's page or leaf at logical page (LPAGE) place of determining the page number correspondence of each catalogue entry comprises: according to the total page number of the page number in the page number directory entry, digital document catalog page or leaf and the range threshold parameter of setting, determine candidate's page or leaf at logical page (LPAGE) place of the page number correspondence of this catalogue entry, promptly determine candidate's page or leaf at the logical page (LPAGE) place of this catalogue entry correspondence.
Be specially when the page number is n in the page number directory entry, the total page number of digital document catalog page or leaf is K, and the range threshold parameter of She Dinging is D simultaneously, and then candidate's page or leaf N at the logical page (LPAGE) place of this page number n correspondence is as can be known: n+K-D≤N≤n+K+D.The size of range threshold parameter D can be set in the actual calculation process as required flexibly, adopt suitable range threshold parameter to reach and improve the efficient that link is set up, also can satisfy the requirement of accuracy simultaneously.
S302: in each candidate's page or leaf, extract effective information.
Specifically comprise: according to the information of the type page scope of preserving, reach the coordinate of each character in each candidate's page or leaf, determine to be positioned at the extraneous character of this type page, from the extraneous character of this type page, extract numerical character.Promptly determine the character in the extraneous headerfooter of type page, from this character, extract numerical character.Wherein, the information of the type page scope of preservation comprises: the upper border line of type page scope, following boundary line, left side boundary and boundary line, the right information,
Wherein the coordinate of each character comprises the coordinate of this character of determining according to the minimum boundary rectangle frame of this character, the coordinate of character is with the coordinate representation on the summit at two diagonal angles of its minimum boundary rectangle frame, as Fig. 4, the coordinate of character " order " can adopt the coordinate representation of summit 1 and 3, perhaps adopt the coordinate representation of summit 2 and 4, for example adopt the coordinate of the coordinate representation character of summit 1 and 3, the coordinate representation of this character is (x 1, y 1, x 2, y 2), x 1Be the horizontal ordinate on summit 1, i.e. the distance of summit 1 range coordinate axle y, y 1Be the ordinate on summit 1, i.e. the distance of summit 1 range coordinate axle x, x 2Be the horizontal ordinate on summit 3, i.e. the distance of summit 3 range coordinate axle y, y 2Be the ordinate on summit 3, i.e. the distance of summit 3 range coordinate axle x.
S303: merge the effective information that extracts.
According to the numerical character information of extracting, judge whether per two intercharacter distances of numeral surpass the spacing threshold value of setting, when the spacing of two numerical characters does not surpass the spacing threshold value of setting, these two numerical characters are merged into a digit strings; Otherwise think that these two numerical characters are two independently digit strings.
Wherein, when whether the spacing of judging per two numerical characters surpasses the spacing threshold value of setting, can judge according to the coordinate of per two numerical characters, the method of determining each character coordinates as shown in Figure 4, judge at first whether two characters can be thought in same delegation, wherein concrete deterministic process can compare the ordinate of two characters, when the absolute value of difference of ordinate of two numerical character correspondences during less than the first condition value set, judge that these two numerical characters are in same delegation, otherwise different rows, Dui Ying ordinate wherein promptly when the ordinate on the summit 3 of adopting a numerical character, also should adopt the ordinate on second digit character summit 3; Whether the level interval of judging two numerical characters among the colleague then satisfies the second condition value of setting, the abscissa value that for example compares two numerical characters, when the absolute value of the difference of the horizontal ordinate of two horizontal ordinate correspondences during, judge that then these two characters can merge into a digit strings less than the second condition value set.Certainly in concrete computation process, can also adopt coordinate to determine whether two numerical characters merge into the method for a digit strings, just not give unnecessary details one by one here according to other.
S304:, determine the logical page (LPAGE) of correspondence in this catalogue entry according to the result of coupling with the page number information of page number directory entry in the catalogue entry and the effective information coupling of merging.
Specifically comprise: the page number of page number directory entry in the catalogue entry and each character string after the merging are carried out the comparison of size, whether identical according to each character string after merging with the page number, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.Can choose the highest candidate's page or leaf of the first catalogue entry degree of confidence, as logical page (LPAGE) corresponding in this catalogue entry.
Wherein can comprise in the specific implementation process: at first set an identical initial degree of confidence X for each the candidate's page or leaf in logical page (LPAGE) place candidate's page or leaf of the page number correspondence in the catalogue entry page number directory entry, with each the digit strings coupling after the merging in the page number in the catalogue entry page number directory entry and each the candidate's page or leaf, whenever find one with this catalogue entry in the digit strings of the page number coupling time, the degree of confidence of this candidate's page or leaf correspondence is added Y, when whenever find one with this catalogue entry in the unmatched digit strings of the page number time, the degree of confidence of this candidate's page or leaf correspondence is subtracted E, thereby determine that this candidate's page or leaf is to first degree of confidence that should catalogue entry.For example this candidate's page or leaf has altogether to merge and has obtained 5 digit strings, the original execution degree of this candidate's page or leaf is X, page number coupling in a digit strings and the catalogue entry is arranged, and the page number in 4 digit strings and the catalogue entry does not match, and then the degree of confidence of this candidate's page or leaf correspondence is X+Y-4E as can be known.Wherein, X, Y and E are the arithmetic number greater than zero.
Simultaneously in embodiments of the present invention also can be according to the head table item in the catalogue entry, determine the logical page (LPAGE) corresponding with each catalogue entry, as shown in Figure 5, for the embodiment of the invention provide according to the head table item in the catalogue entry, determine the method for each logical page (LPAGE) of each catalogue entry correspondence, specifically may further comprise the steps:
S501: the coordinate according to all characters in every page is arranged as several rows with all characters in every page.
Specifically comprise: in the page of each candidate's page or leaf, with all character orderings, judge at first whether per two characters are same delegation, writing direction with the digital document catalog item is that horizontally-arranged is an example, can whether be no more than the spacing parameter h that presets according to the spacing of the vertical direction of judging two characters, and wherein h is an arithmetic number, when two characters pitch spacing between vertical direction is not more than spacing parameter h, then two characters are arranged in delegation, otherwise, two characters are not arranged in delegation; In each row, the principle that increases progressively successively according to horizontal ordinate is with the character ordering of every row then.As shown in Figure 4, the minimum boundary rectangle frame that then obtains this journey after the ordering is (x m, y m, x n, y n), the minimum boundary rectangle frame of all characters is included in the minimum boundary rectangle frame of this row in this row, wherein x mBe the abscissa value of high order end character 1 in this journey, this horizontal ordinate can be the horizontal ordinate of the left upper apex of this character or the horizontal ordinate on summit, lower-left, y mBe in this journey ordinate value of character 3 topmost, this ordinate can be the ordinate of the left upper apex of this character or the ordinate on upper right summit, x nBe the abscissa value of low order end character 4 in this journey, this horizontal ordinate can be the horizontal ordinate on the upper right summit of this character or the horizontal ordinate on summit, bottom right, y nBe the ordinate value of character 2 bottom in this journey, this ordinate can be the ordinate on the summit, lower-left of this character or the ordinate on summit, bottom right.
S502: in every row with the character of this row, with the heading message coupling in the head table item in the catalogue entry of preserving.
Concrete matching process comprises: (Longest Common Subsequence, LCS) algorithm carry out the coupling of similarity between character string, and this character string comprises: the title in the catalogue entry and the character of every row according to Longest Common Substring; Again according at least one characteristic information that is provided with, determine total degree of confidence of the corresponding head table item of this row information simultaneously.
Wherein, at least one characteristic information of this setting comprises: the position of this line character in the digital document page at its place, or the size of the average literal width of the average literal width of this line character and body part, or whether go together with other alphabetic character according to the character string of LCS algorithmic match.Can determine total degree of confidence of the corresponding head table item of every line character information according to above-mentioned at least one characteristic information and LCS algorithm.
S503:, determine logical page (LPAGE) corresponding in this catalogue entry according to the matching result of every row in every page and heading message.
Matching degree according to every line character and heading message obtains total degree of confidence, with every row is corresponding in every page the highest total degree of confidence as this page to the second catalogue entry degree of confidence that should catalogue entry, according to every page to the second catalogue entry degree of confidence that should catalogue entry, determine the logical page (LPAGE) of each catalogue entry correspondence.
Certainly by the heading message in the catalogue entry, determine the logical page (LPAGE) of each catalogue entry correspondence, the implementation procedure reliability height that it is concrete, but also can influence simultaneously the efficient that link foundation between catalogue and the text, therefore can adopt page number information and heading message in conjunction with the logical page (LPAGE) of determining each catalogue entry correspondence, specifically comprise: candidate's page or leaf of determining the logical page (LPAGE) place of this page number correspondence according to page number information, in each candidate's page or leaf, carry out the coupling of page number information, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry, in each candidate's page or leaf, mate simultaneously according to heading message, determine that each candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry, according to each candidate's page or leaf to the first catalogue entry degree of confidence that should catalogue entry, and each candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry, and setting page number information coupling and the corresponding weight coefficient of heading message coupling, determine total degree of confidence of the corresponding catalogue entry of each candidate's page or leaf, thereby determine the logical page (LPAGE) of each catalogue entry correspondence.
As shown in Figure 6A,, specify and set up the method that links between digital document catalog and the text, specifically may further comprise the steps for " psychological health education " of publishing in 2006 with publishing house of University of the Inner Mongol in the embodiment of the invention is example:
Step 601: read in digital document, obtain the catalogue entry information of preservation.
This digital document has 236 pages, 59 of catalogue entries.Be that example is described the process of setting up between this catalogue entry and the text that links in detail wherein with the catalogue entry of chapter 2 first segment correspondence shown in Figure 7.The chapters and sections directory entry is " first segment " in this catalogue entry, and the head table item is " self-consciousness general introduction ", and page number directory entry is " 20 ".
Step 602:, determine the logical page (LPAGE) of each catalogue entry correspondence according at least one directory entry information of the information setting of the catalogue entry of preserving.
Step 603:, set up the link of each logical page (LPAGE) of each catalogue entry and correspondence according to the logical page (LPAGE) of each catalogue entry correspondence of determining.
Be depicted as in embodiments of the present invention according to the page number information of the page number directory entry in the catalogue entry and the heading message in the head table item as Fig. 6 B, determine the method for the logical page (LPAGE) of each catalogue entry correspondence, determine that specifically the process of the logical page (LPAGE) of each catalogue entry correspondence comprises:
Step 602a:, determine candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence according to the page number information in the page number directory entry in the catalogue entry.
Wherein the page number is 20 in the page number directory entry of this catalogue entry, concern n+K-D≤N≤n+K+D according to the candidate's page or leaf at the logical page (LPAGE) place that is provided with and the page number, wherein the total page number K of this digital document is 5, range threshold parameter D is 3, determines that then candidate's page or leaf at the logical page (LPAGE) place of this catalogue entry correspondence is the 22nd page to the 28th page.
Step 602b: in each candidate's page or leaf, extract effective information.
The type page range information of this digital document of Bao Cuning is in embodiments of the present invention, and the upper border line ordinate is 80.73, and the horizontal ordinate in boundary line, the left side is 0, and the horizontal ordinate in boundary line, the right is 485, and the ordinate in following boundary line is 697.10.In each candidate's page or leaf, according to the coordinate of each character, determine the extraneous character of type page, and from the extraneous character of type page, extract numerical character.In concrete computation process, as shown in Figure 4, the coordinate of each character is determined according to the summit 1 of the minimum boundary rectangle frame of this character and the coordinate on summit 3, when the ordinate on the summit 1 of character littler than 80.73, or the ordinate on summit 3 is bigger than 697.10, or the horizontal ordinate on summit 1 is littler than 0, or the horizontal ordinate on summit 3 is when bigger than 485, all thinks being positioned at outside the type page scope of this character.
Extract numerical character being arranged in the extraneous character of type page.The numerical character that proposes is merged.For example extract numerical character and be " 7 " and " 1 ", wherein the coordinate of numerical character " 7 " is (421.05,699.83,425.76,706.94), and the coordinate of numerical character " 1 " is (416.74,699.83,419.47,706.94).The numerical character that extracts is sorted according to its coordinate.For example the horizontal ordinate on numerical character " 1 " summit 1 is greater than the horizontal ordinate on numerical character " 7 " summit 1, the horizontal ordinate on numerical character " 1 " summit 3 is less than the horizontal ordinate on numerical character " 7 " summit 3, and numerical character " 1 " is identical with the ordinate of " 7 " correspondence, then two numerical characters are in same delegation as can be known, and numerical character " 1 " is on the left side of numerical character " 7 ".
The difference of the horizontal ordinate on the horizontal ordinate on numerical character " 7 " summit 1 and numerical character " 1 " summit 3 is 1.58 simultaneously, the spacing threshold value of setting is 2.37 to 4.71 a numerical value, then two numerical characters can be merged into a digit strings as can be known, this digit strings is " 17 " after the merging, and the coordinate of this digit strings is (416.74,699.83,425.76,706.94).
Step 602c: the page number information in the catalogue entry of the effective information that extracts and preservation is mated, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.
Digit strings after merging and the page number information in the catalogue entry are compared.Page number information in this catalogue entry is 20, and after digit strings be 17, these two numerical characters are not inconsistent.Therefore the degree of confidence with candidate's page or leaf subtracts E, and the initial degree of confidence X of present embodiment is 50, and E is 6, and then the degree of confidence of this candidate page or leaf is 44 as can be known.
Adopting said method to obtain, is that the first catalogue entry degree of confidence that each the candidate's page or leaf in the 22nd page to the 28th page carries out obtaining after the page number coupling is respectively 44,44,44,80,44,44,44 for natural number of pages.
Step 602d: the coordinate according to all characters in every page is arranged as several rows with all characters in every page.
In order to guarantee that all characters are arranged according to row in each page or leaf, vertical range in the process of arranging between intercharacter horizontal median axis need meet some requirements, the vertical range of this intercharacter horizontal median axis can be according to the mean value of the ordinate on two summits of the top and bottom of calculating character, calculates the difference of mean value of ordinate of the correspondence of two characters again and determine.Judge that in embodiments of the present invention the method whether two character A and B can come delegation is: the mean value of two ordinates of calculating character A, and the difference of the bigger ordinate of calculating character A and less ordinate, the mean value of two ordinates of while calculating character B, and the difference of the bigger ordinate of calculating character B and less ordinate, the difference of the mean value of the ordinate of judgement character A, B correspondence, whether, promptly judge less than the difference of bigger ordinate less among two character A, the B and less ordinate and the product of parameter:
| Y 1 ( A ) + Y 2 ( A ) 2 - Y 1 ( B ) + Y 2 ( B ) 2 | < MIN [ Y 2 ( A ) - Y 1 ( A ) , Y 2 ( B ) - Y 1 ( B ) ] &times; j
Wherein, MIN represents to get smaller value among both, and j is the arithmetic number less than 1, Y 1(A) be the less ordinate value of character A, Y 2(A) be the bigger ordinate value of character A, Y 1(B) be the less ordinate value of character B, Y 2(B) be the bigger ordinate value of character B.When judged result when being, A and B are aligned to delegation, otherwise A and B are aligned to different rows, judge successively then whether the ordinate of B and two characters of C satisfies above-mentioned condition, judge whether B and C are aligned to delegation.Adopt the method that all characters in each page or leaf are arranged.After adopting the method to arrange, the corresponding minimum boundary rectangle frame of each row, as shown in Figure 4.
Step 602e:, determine the second catalogue entry degree of confidence of the corresponding catalogue entry of each candidate's page or leaf with the every row and the coupling of the heading message in the catalogue entry of each candidate's page or leaf.
Step 602f: according to first catalogue entry degree of confidence of corresponding each catalogue entry of each candidate's page or leaf and second degree of confidence of corresponding each catalogue entry, determine total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf, determine the logical page (LPAGE) of each catalogue entry correspondence according to this total degree of confidence.
According to the first catalogue entry degree of confidence dPageVeri of the corresponding catalogue entry of each candidate's page or leaf and second catalogue entry degree of confidence+dTitleVeri of corresponding catalogue entry, and the weight coefficient dTitleWeight of the weight coefficient dPageWeight of the first catalogue entry degree of confidence correspondence and the second catalogue entry degree of confidence correspondence, determine total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf, wherein the weight coefficient that the weight coefficient dPageWeight of the first catalogue entry degree of confidence correspondence is corresponding with the second catalogue entry degree of confidence dTitleWeight's and be 1, and all be arithmetic number greater than zero, for example dPageWeight is 0.4, and dPageWeight is 0.6.Adopting the above-mentioned method of determining total degree of confidence to obtain as shown in Figure 8 nature page is that total degree of confidence of candidate's page or leaf of the 25th page is 94.Select the logical page (LPAGE) of the highest candidate's page or leaf of total degree of confidence for this catalogue entry correspondence.Simultaneously also can set total confidence threshold value, with the candidate page or leaf of total degree of confidence, as the logical page (LPAGE) of this catalogue entry correspondence above total confidence threshold value.
Wherein, the process of determining the second catalogue entry degree of confidence of the corresponding catalogue entry of each candidate's page or leaf comprises:
Be illustrated in figure 8 as nature page or leaf after the arrangement that the embodiment of the invention provides and be the 25th page content, adopt the LCS algorithm that the character of every row in candidate's page or leaf and the heading character in the catalogue entry are mated, determine that according to the result of coupling every row in candidate's page or leaf is to the first directory entry information degree of confidence that should head table item information, the process of determining the second directory entry information degree of confidence comprises: the position of character in every row of determining every row in candidate's page or leaf, determine second degree of confidence of every row according to this position, and the average literal width size of the average literal width of more every line character and body part, determine the 3rd degree of confidence of every row, while is according to the character string of LCS algorithmic match success, determine whether this character string goes together with other alphabetic characters, determine the 4th corresponding degree of confidence of every row, second degree of confidence according to above-mentioned condition correspondence, the 3rd degree of confidence and the 4th degree of confidence, and the weight coefficient of each condition correspondence, determine that every row is to the second directory entry information degree of confidence that should head table item information, according to this first directory entry information degree of confidence and the second directory entry information degree of confidence, determine the composite catalog item information degree of confidence of every row.According to the composite catalog item information degree of confidence of every row in each page, the value of getting wherein total degree of confidence maximum as this candidate's page or leaf to the second catalogue entry degree of confidence that should catalogue entry.
Wherein adopt the LCS algorithm that the character of every row in candidate's page or leaf and the title in the catalogue entry are mated, the parameter of this algorithm input is two character strings, these two character strings are the character string of the title and the row to be matched of catalogue entry, return the longest public substring part of these two character strings as calculated, according to the longest common characters string part of returning, thereby the similarity that can determine two character strings is determined the first directory entry information degree of confidence of this row.For example the character " general introduction of first segment self-consciousness " of the 25th page of second row and the title " self-consciousness general introduction " of catalogue entry are mated according to the LCS algorithm, the result of output is " self-consciousness general introduction " after mating, result after the coupling is identical with the title of catalogue entry, can determine that then the first directory entry information degree of confidence of this row is 100.The high more row of the first directory entry information degree of confidence, then the similarity of the title of the character of this row and catalogue entry is high more.
And carry out the judgement of every line character position.The deterministic process of every line character position specifically comprises: compare the first definite axis of coordinate of being expert at according to the character of every row, with the absolute value of the horizontal range difference of second axis of determining according to the left and right boundary line of type page scope, determine second degree of confidence of every row according to the absolute value of this horizontal range difference.The high more row of second degree of confidence wherein, then the absolute value of this horizontal range difference is more little.
And when the position of the character of determining every row, also can determine whether the Position Approximate of the character of every row is positioned at the centre position of determining according to the left and right boundary line of type page scope according to the coordinate of every row, when the character of this row is positioned at this centre position, whether the length of determining this row according to the coordinate of this row satisfies the length condition of setting, for example the length condition that should set is less than 80% of whole type page scope boundary line, the right and boundary line, left side difference, whether the length according to this row satisfies the length condition of setting, and determines second degree of confidence of this row; When the character that perhaps also can work as this row is positioned at the left side of type page, whether the length of determining this row according to the coordinate of this row satisfies the length condition of setting, for example the length condition that should set is less than 70% of whole type page scope boundary line, the right and boundary line, left side difference, determines second degree of confidence of this row.
Wherein, when the Position Approximate of the character of determining every row according to the coordinate of every row, row-coordinate for example shown in Figure 4, can the coordinate of more every row in X mWith first difference in the boundary line, the left side of type page scope, and the boundary line, the right and the X of type page scope nThe size of second difference, when the difference of first difference and second difference during greater than the difference threshold set, judge that then the character of this row is positioned at the both sides of whole type page, and when first difference is big than second difference, judge that the character of this row is positioned at the right-hand member of whole type page, when first difference than second difference hour, judge that the character of this row is positioned at the left end of whole type page.Certainly in the position judgment process of reality, may also have a lot of methods, judge that according to the coordinate difference method of every line character position all should be in protection scope of the present invention but be based on embodiment of the invention thought.
Judge the size of the average literal width of the average literal width of every line character and digital document text simultaneously, for example the average literal width of second row is 13.77 among Fig. 8, the average literal width of text is 10.29 in this digital document, the average literal width of this second row is determined the 3rd degree of confidence of this row correspondence greater than the average literal width of text in the digital document.The high more row of the 3rd degree of confidence wherein, the average literal width of this line character is big more.
And, judge the alphabetic character that whether also has other in this row according to the character string of LCS algorithmic match success.Specifically judge in the process of the character that whether also has other in this row, will with the alphabetic character of the coordinate direct neighbor of this character string that the match is successful, be defined as existing in this row other alphabetic character, when there is indirect annexation in alphabetic character with this character that the match is successful, for example " the second joint self-consciousness general introduction ", the character that the match is successful is " self-consciousness general introduction ", with this character string on coordinate, link to each other for the space, there is indirect annexation in " second joint " with this character string, can think the alphabetic character that does not have other in the row at this character string place that the match is successful, the alphabetic character that whether has other in the row according to this character string place that the match is successful is determined the 4th degree of confidence of this row.
According to first directory entry information degree of confidence of every row and the second directory entry information degree of confidence of determining by second degree of confidence, the 3rd degree of confidence and the 4th degree of confidence, and the weight coefficient of each degree of confidence correspondence, determine the composite catalog item information degree of confidence of every row, get the highest value of composite catalog item information degree of confidence as each candidate's page or leaf to the second catalogue entry degree of confidence that should catalogue entry, wherein, the weight coefficient of each degree of confidence correspondence is an arithmetic number.
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, can be by obtaining at least one directory entry information according to the catalogue entry information of preserving, this at least one directory entry information is mated at the page of digital document, result according to coupling determines each logical page (LPAGE) of each catalogue entry correspondence, thereby sets up linking between each catalogue entry and this logical page (LPAGE).Adopt this to set up the method that links between digital document catalog and the text automatically, what link between the catalogue that can improve digital document effectively and the text sets up efficient, and then improves the make efficiency of digital document.
Simultaneously because in digital document, the nature page or leaf and the actual not corresponding principal element of logical page (LPAGE) that influence in the catalogue entry are text colophon before, catalogue page, preface, contents such as preface appendix, therefore as long as in the process that links between the catalogue of setting up digital document and the text, identify the preceding colophon of this text, catalogue page, preface, the number of pages of contents such as preface appendix, according to the page number information in the digital document catalog clauses and subclauses of this number of pages and preservation, also can determine the logical page (LPAGE) of each catalogue entry correspondence, thereby set up the link between each catalogue entry and each logical page (LPAGE).Believe in the concrete implementation procedure that those skilled in the art can carry out concrete enforcement according to the method that the embodiment of the invention provides, just do not give unnecessary details one by one here.
As shown in Figure 9, for the embodiment of the invention provides a kind of device that links between digital document catalog and the text of setting up, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, comprising:
Logical page (LPAGE) identification module 90 is used for obtaining at least one directory entry information from each catalogue entry of preserving, according to described at least one directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document;
Module 91 is set up in link, is used to set up linking between each catalogue entry and corresponding each logical page (LPAGE).
Described logical page (LPAGE) identification module 90 comprises:
First recognition unit 901 is used for when at least one the directory entry information that obtains is page number directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described first recognition unit 901 comprises:
First candidate's page or leaf is determined subelement 9010, is used for determining candidate's page or leaf at the logical page (LPAGE) place of described each catalogue entry correspondence according to the page number directory entry information of the rule that presets according to each catalogue entry;
The first coupling subelement 9011 is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first computation subunit 9012 is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to first degree of confidence that should catalogue entry;
Logical page (LPAGE) first is determined subelement 9013, is used for determining according to described first degree of confidence each logical page (LPAGE) of each catalogue entry correspondence.
Described logical page (LPAGE) identification module 90 comprises:
Second recognition unit 902 is used for when at least one directory entry information of obtaining is head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described second recognition unit 902 comprises:
Row degree of confidence first is determined subelement 9020, is used for the similarity according to every line character in the head table item information of each catalogue entry and the every number of pages word document, determines that every line character in this number of pages word document is to first degree of confidence that should head table item information;
Row degree of confidence second is determined subelement 9021, is used at least one characteristic information according to the every line character of this number of pages word document, determines that every line character in this number of pages word document is to second degree of confidence that should head table item information;
Second computation subunit 9022 is used for according to the every line character of this number of pages word document total degree of confidence that should head table item information, determines that this number of pages word document is to second degree of confidence that should catalogue entry;
Logical page (LPAGE) second is determined subelement 9023, is used for determining according to described second degree of confidence each logical page (LPAGE) of each directory entry information correspondence.
The degree of confidence second of wherein going determines that at least one characteristic information in the subelement 9021 comprises: the average literal width information of every line character in the positional information of every line character or the digital document in the digital document, or whether digital document and the on all four character of described head table project information go together with other alphabetic characters.
Described logical page (LPAGE) identification module 90 comprises:
The 3rd recognition unit 903 is used for when described at least one directory entry information is page number directory entry information and head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described the 3rd recognition unit 903 comprises:
Second candidate's page or leaf is determined subelement 9030, is used for the page number directory entry information according to each catalogue entry, determines candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence;
First degree of confidence is determined subelement 9031, is used for determining first degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Second degree of confidence is determined subelement 9032, is used for determining second degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Total degree of confidence is determined subelement 9033, is used for determining according to described first degree of confidence and described second degree of confidence total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Logical page (LPAGE) the 3rd is determined subelement 9034, is used for determining according to described total degree of confidence the logical page (LPAGE) of each catalogue entry correspondence.
Described first degree of confidence determines that subelement 9031 comprises:
First matched sub-block is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first calculating sub module is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.
Described second degree of confidence determines that subelement 9032 comprises:
Row degree of confidence first is determined submodule, is used for the similarity according to every line character in the head table item information of each catalogue entry and each the candidate's page or leaf, determines that every line character in this candidate's page or leaf is to first a record item information degree of confidence that should head table item information;
Row degree of confidence second is determined submodule, is used at least one characteristic information according to the every line character of this candidate's page or leaf, determines that every line character in this candidate's page or leaf is to the second directory entry information degree of confidence that should head table item information;
Second calculating sub module is used for according to the every line character of this candidate's page or leaf total degree of confidence that should head table item information, determines that this candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry.
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, can be by obtaining at least one directory entry information according to the catalogue entry information of preserving, this at least one directory entry information is mated at the page of digital document, result according to coupling determines each logical page (LPAGE) of each catalogue entry correspondence, thereby sets up linking between each catalogue entry and this logical page (LPAGE).Adopt this to set up the method that links between digital document catalog and the text automatically, what link between the catalogue that can improve digital document effectively and the text sets up efficient, and then improves the make efficiency of digital document.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (15)

1, a kind of method that links between digital document catalog and the text of setting up, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, it is characterized in that, comprising:
From each catalogue entry of preserving, obtain at least one directory entry information,, in digital document, determine each logical page (LPAGE) of each catalogue entry correspondence according to described at least one directory entry information;
Set up linking between each catalogue entry and corresponding each logical page (LPAGE).
2, the method for claim 1 is characterized in that, described at least one the directory entry information of obtaining comprises:
Obtain page number directory entry information and/or head table item information.
3, method as claimed in claim 2 is characterized in that, when at least one the directory entry information that obtains is page number directory entry information, determines that in digital document each logical page (LPAGE) of each catalogue entry correspondence comprises:
According to the page number directory entry information of the rule that presets, determine candidate's page or leaf at the logical page (LPAGE) place of described each catalogue entry correspondence according to each catalogue entry;
Extract effective information in each candidate's page or leaf, relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
According to each effective information and described page number directory entry information comparative result, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry;
Determine each logical page (LPAGE) of each catalogue entry correspondence according to the described first catalogue entry degree of confidence.
4, method as claimed in claim 2 is characterized in that, when at least one directory entry information of obtaining is head table item information, determines that in digital document each logical page (LPAGE) of each catalogue entry correspondence comprises:
According to the similarity of every line character in the head table item information in each catalogue entry and the every number of pages word document, determine that every line character in this number of pages word document is to the first directory entry information degree of confidence that should head table item information;
According at least one characteristic information of every line character in this number of pages word document, determine that every line character in this number of pages word document is to the second directory entry information degree of confidence that should head table item information;
To the first directory entry information degree of confidence and the second directory entry information degree of confidence that should head table item information, determine that every line character in this number of pages word document is to composite catalog item information degree of confidence that should head table item information according to every line character in this number of pages word document;
To composite catalog item information degree of confidence that should head table item information, determine that this number of pages word document is to the second catalogue entry degree of confidence that should catalogue entry according to every line character in this number of pages word document;
Determine each logical page (LPAGE) of each directory entry information correspondence according to the described second catalogue entry degree of confidence.
5, method as claimed in claim 4 is characterized in that, described at least one characteristic information comprises:
The average literal width information of every line character in the positional information of every line character or the digital document in the digital document, or digital document and the on all four character of described head table project information whether with other alphabetic character peer messages.
6, method as claimed in claim 2 is characterized in that, when described at least one directory entry information is page number directory entry information and head table item information, determines that in digital document each logical page (LPAGE) of each catalogue entry correspondence comprises:
According to the page number directory entry information in each catalogue entry, determine candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence;
In each candidate's page or leaf, extract effective information, relatively whether each effective information is identical with page number directory entry information in this catalogue entry, according to each effective information and described page number directory entry information comparative result, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry; And
According to the similarity of every line character in the head table item information in each catalogue entry and each the candidate's page or leaf, determine that every line character in this candidate's page or leaf is to the first directory entry information degree of confidence that should head table item information; According at least one characteristic information of every line character in this candidate's page or leaf, determine that every line character in this candidate's page or leaf is to the second directory entry information degree of confidence that should head table item information; According to the described first directory entry information degree of confidence and the second directory entry information degree of confidence of every line character correspondence in this candidate's page or leaf, determine that every line character in this candidate's page or leaf is to composite catalog item information degree of confidence that should head table item information; To composite catalog item information degree of confidence that should head table item information, determine that this number of pages word document is to the second catalogue entry degree of confidence that should catalogue entry according to every line character in this number of pages word document;
Determine total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf according to the described first catalogue entry degree of confidence and the second catalogue entry degree of confidence;
Determine the logical page (LPAGE) of each catalogue entry correspondence according to described total degree of confidence.
7, a kind of device that links between digital document catalog and the text of setting up, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, it is characterized in that, and described device comprises:
The logical page (LPAGE) identification module is used for obtaining at least one directory entry information from each catalogue entry of preserving, according to described at least one directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document;
Module is set up in link, is used to set up linking between each catalogue entry and corresponding each logical page (LPAGE).
8, device as claimed in claim 7 is characterized in that, described logical page (LPAGE) identification module comprises:
First recognition unit is used for when at least one the directory entry information that obtains is page number directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
9, device as claimed in claim 8 is characterized in that, described first recognition unit comprises:
First candidate's page or leaf is determined subelement, is used for determining candidate's page or leaf at the logical page (LPAGE) place of described each catalogue entry correspondence according to the page number directory entry information of the rule that presets according to each catalogue entry;
The first coupling subelement is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first computation subunit is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry;
Logical page (LPAGE) first is determined subelement, is used for determining according to the described first catalogue entry degree of confidence each logical page (LPAGE) of each catalogue entry correspondence.
10, device as claimed in claim 7 is characterized in that, described logical page (LPAGE) identification module comprises:
Second recognition unit is used for when at least one directory entry information of obtaining is head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
11, device as claimed in claim 10 is characterized in that, described second recognition unit comprises:
Row degree of confidence first is determined subelement, be used for similarity, determine that every line character in this number of pages word document is to first record information degree of confidence that should head table item information according to every line character in the head table item information of each catalogue entry and the every number of pages word document;
Row degree of confidence second is determined subelement, is used at least one characteristic information according to the every line character of this number of pages word document, determines that every line character in this number of pages word document is to the second directory entry information degree of confidence that should head table item information;
Second computation subunit is used for according to the every line character of this number of pages word document total degree of confidence that should head table item information, determines that this number of pages word document is to the second catalogue entry degree of confidence that should catalogue entry;
Logical page (LPAGE) second is determined subelement, is used for determining according to described second degree of confidence each logical page (LPAGE) of each directory entry information correspondence.
12, device as claimed in claim 7 is characterized in that, described logical page (LPAGE) identification module comprises:
The 3rd recognition unit is used for when described at least one directory entry information is page number directory entry information and head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
13, device as claimed in claim 12 is characterized in that, described the 3rd recognition unit comprises:
Second candidate's page or leaf determining unit is used for the page number directory entry information according to each catalogue entry, determines candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence;
First degree of confidence is determined subelement, is used for determining the first catalogue entry degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Second degree of confidence is determined subelement, is used for determining the second catalogue entry degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Total degree of confidence is determined subelement, is used for determining according to the described first catalogue entry degree of confidence and the described second catalogue entry degree of confidence total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Logical page (LPAGE) the 3rd is determined subelement, is used for determining according to described total degree of confidence the logical page (LPAGE) of each catalogue entry correspondence.
14, device as claimed in claim 13 is characterized in that, described first degree of confidence determines that subelement comprises:
First matched sub-block is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first calculating sub module is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.
15, device as claimed in claim 13 is characterized in that, described second degree of confidence determines that subelement comprises:
Row degree of confidence first is determined submodule, is used for the similarity according to every line character in the head table item information of each catalogue entry and each the candidate's page or leaf, determines that every line character in this candidate's page or leaf is to first a record item information degree of confidence that should head table item information;
Row degree of confidence second is determined submodule, is used at least one characteristic information according to the every line character of this candidate's page or leaf, determines that every line character in this candidate's page or leaf is to the second directory entry information degree of confidence that should head table item information;
Second calculating sub module is used for according to the every line character of this candidate's page or leaf total degree of confidence that should head table item information, determines that this candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry.
CN2008102227847A 2008-09-24 2008-09-24 Method and apparatus for establishing links between digital document catalog and text Expired - Fee Related CN101354727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102227847A CN101354727B (en) 2008-09-24 2008-09-24 Method and apparatus for establishing links between digital document catalog and text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102227847A CN101354727B (en) 2008-09-24 2008-09-24 Method and apparatus for establishing links between digital document catalog and text

Publications (2)

Publication Number Publication Date
CN101354727A true CN101354727A (en) 2009-01-28
CN101354727B CN101354727B (en) 2011-06-29

Family

ID=40307534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102227847A Expired - Fee Related CN101354727B (en) 2008-09-24 2008-09-24 Method and apparatus for establishing links between digital document catalog and text

Country Status (1)

Country Link
CN (1) CN101354727B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385474A (en) * 2010-09-06 2012-03-21 北大方正集团有限公司 Data output method, device and system
WO2012151887A1 (en) * 2011-07-26 2012-11-15 中兴通讯股份有限公司 Keyboard terminal and location method for electronic document thereof
WO2013083067A1 (en) * 2011-12-09 2013-06-13 北大方正集团有限公司 Method and device for acquiring structured information in layout file
CN103176956A (en) * 2011-12-21 2013-06-26 北大方正集团有限公司 Method and device for extracting file structure
CN103714101A (en) * 2012-10-04 2014-04-09 富士施乐株式会社 Information processing apparatus and information processing method
US9256352B2 (en) 2011-07-26 2016-02-09 Zte Corporation Touch screen terminal and method for locating electronic document thereof
TWI549003B (en) * 2014-08-18 2016-09-11 葆光資訊有限公司 Method for automatic sections division
CN107291682A (en) * 2016-03-30 2017-10-24 同方知网(北京)技术有限公司 It is a kind of to divide piece algorithm based on many electronic documents for redirecting processing and twin check
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN110516974A (en) * 2019-08-30 2019-11-29 贵州大学 Based on the matched case quality appraisal procedure of evidence
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN112487759A (en) * 2019-08-23 2021-03-12 珠海金山办公软件有限公司 Document page number setting method and device, electronic equipment and storage medium
CN112559676A (en) * 2019-09-25 2021-03-26 北京新唐思创教育科技有限公司 Similar topic retrieval method and device and computer storage medium

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385474A (en) * 2010-09-06 2012-03-21 北大方正集团有限公司 Data output method, device and system
CN102385474B (en) * 2010-09-06 2014-06-04 北大方正集团有限公司 Data output method, device and system
US9448984B2 (en) 2011-07-26 2016-09-20 Zte Corporation Keyboard type terminal and location method for electronic document therein
WO2012151887A1 (en) * 2011-07-26 2012-11-15 中兴通讯股份有限公司 Keyboard terminal and location method for electronic document thereof
CN102902679A (en) * 2011-07-26 2013-01-30 中兴通讯股份有限公司 Keyboard terminal and method for locating E-documents in keyboard terminal
US9256352B2 (en) 2011-07-26 2016-02-09 Zte Corporation Touch screen terminal and method for locating electronic document thereof
CN102902679B (en) * 2011-07-26 2017-05-24 中兴通讯股份有限公司 Keyboard terminal and method for locating E-documents in keyboard terminal
WO2013083067A1 (en) * 2011-12-09 2013-06-13 北大方正集团有限公司 Method and device for acquiring structured information in layout file
CN103164388A (en) * 2011-12-09 2013-06-19 北大方正集团有限公司 Method and device for obtaining structuring information in layout files
US9773009B2 (en) 2011-12-09 2017-09-26 Beijing Founder Apabi Technology Limited Methods and apparatus for obtaining structured information in fixed layout documents
CN103164388B (en) * 2011-12-09 2016-07-06 北大方正集团有限公司 In a kind of layout files structured message obtain method and device
CN103176956A (en) * 2011-12-21 2013-06-26 北大方正集团有限公司 Method and device for extracting file structure
US9418051B2 (en) 2011-12-21 2016-08-16 Peking University Founder Group Co., Ltd. Methods and devices for extracting document structure
CN103176956B (en) * 2011-12-21 2016-08-03 北大方正集团有限公司 For the method and apparatus extracting file structure
CN103714101A (en) * 2012-10-04 2014-04-09 富士施乐株式会社 Information processing apparatus and information processing method
TWI549003B (en) * 2014-08-18 2016-09-11 葆光資訊有限公司 Method for automatic sections division
CN107291682A (en) * 2016-03-30 2017-10-24 同方知网(北京)技术有限公司 It is a kind of to divide piece algorithm based on many electronic documents for redirecting processing and twin check
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN112487759A (en) * 2019-08-23 2021-03-12 珠海金山办公软件有限公司 Document page number setting method and device, electronic equipment and storage medium
CN110516974A (en) * 2019-08-30 2019-11-29 贵州大学 Based on the matched case quality appraisal procedure of evidence
CN112559676A (en) * 2019-09-25 2021-03-26 北京新唐思创教育科技有限公司 Similar topic retrieval method and device and computer storage medium
CN112559676B (en) * 2019-09-25 2022-05-17 北京新唐思创教育科技有限公司 Similar topic retrieval method and device and computer storage medium
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium

Also Published As

Publication number Publication date
CN101354727B (en) 2011-06-29

Similar Documents

Publication Publication Date Title
CN101354727B (en) Method and apparatus for establishing links between digital document catalog and text
JP3427692B2 (en) Character recognition method and character recognition device
CN108470021A (en) The localization method and device of table in PDF document
CN104598577B (en) A kind of extracting method of Web page text
US20090144277A1 (en) Electronic table of contents entry classification and labeling scheme
CN101329731A (en) Automatic recognition method pf mathematical formula in image
CN101458680B (en) Method and apparatus capable of auto identifying digital document catalog
Bai et al. Keyword spotting in document images through word shape coding
CN102375807A (en) Method and device for proofing characters
CN110704570A (en) Continuous page layout document structured information extraction method
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
CN103995904A (en) Recognition system for image file electronic data
CN107291682A (en) It is a kind of to divide piece algorithm based on many electronic documents for redirecting processing and twin check
CN109492177A (en) A kind of web page release method based on web page semantics structure
CN111539417B (en) Text recognition training optimization method based on deep neural network
CN113962201A (en) Document structuralization and extraction method for documents
JPS63182793A (en) Character segmenting system
US8744171B1 (en) Text script and orientation recognition
JP2005216203A (en) Table format data processing method and table format data processing apparatus
CN110688825A (en) Method for extracting information of table containing lines in layout document
CN102467664B (en) Method and device for assisting with optical character recognition
Rashtehroudi et al. PESTD: a large-scale Persian-English scene text dataset
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN116912867B (en) Teaching material structure extraction method and device combining automatic labeling and recall completion
Arias et al. Information extraction from telephone company drawings

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220621

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110629

CF01 Termination of patent right due to non-payment of annual fee