Embodiment
Automatic foundation in order to realize linking between digital document catalog and the text in embodiments of the present invention, improve the efficient that link foundation between digital document catalog and the text, as shown in Figure 1, a kind of method that links between digital document catalog and the text of setting up is provided, wherein said digital document catalog comprises a plurality of catalogue entries, each catalogue entry comprises at least one directory entry information, specifically may further comprise the steps:
S101: from each catalogue entry of preserving, obtain at least one directory entry information,, in digital document, determine each logical page (LPAGE) of each catalogue entry correspondence according to described at least one directory entry information.
Wherein, this at least one directory entry information of obtaining comprises: page number directory entry information and/or head table item information.
S102: set up linking between each catalogue entry and corresponding each logical page (LPAGE).
Below in conjunction with accompanying drawing the embodiment of the invention is described in detail.
The data file that is adopted in embodiments of the present invention, can read this digital document by page or leaf, and can obtain the character of every page of digital document, and can obtain each character at every page coordinate information, can identify simultaneously the font information of literal, be the font type of literal, information such as font size.
As shown in Figure 2, for setting up the method that links between digital document catalog and the text in the embodiment of the invention, specifically may further comprise the steps:
S201: read in digital document, obtain each catalogue entry information of preservation.
The catalogue entry of preserving is, according to the information of the catalogue of the digital document of identification, with each row in the catalogue as a catalogue entry, comprise in this catalogue entry: chapters and sections sequence number directory entry, be the chapters and sections serial number information of this catalogue row representative, for example, chapter 2, protelum etc.; Or the head table item, be the heading message of this catalogue row representative, i.e. Word message after the chapters and sections sequence number, before the page number information; Or page number directory entry, be the nature page or leaf at this chapters and sections place in this catalogue row.
S202:, determine each logical page (LPAGE) of each catalogue entry correspondence according at least one the directory entry information in each catalogue entry.
Wherein at least one the directory entry information in each catalogue entry comprises: the page number directory entry information in the catalogue entry or according to the head table item information in the catalogue entry, perhaps both combinations.
S203: set up linking between each catalogue entry and corresponding each logical page (LPAGE).
As shown in Figure 3, for the embodiment of the invention provide according to the page number directory entry information in the catalogue entry, determine the method for each logical page (LPAGE) of each catalogue entry correspondence, specifically may further comprise the steps:
S301:, determine candidate's page or leaf to logical page (LPAGE) place that should the page number according to the page number information in the page number directory entry in each catalogue entry.
According to the candidate's page or leaf at the logical page (LPAGE) place of presetting and the relation of the page number, determine candidate's page or leaf at the logical page (LPAGE) place of this page number, wherein this specification that presets is according to the page number information of each catalogue entry, candidate's page or leaf at logical page (LPAGE) place of determining the page number correspondence of each catalogue entry comprises: according to the total page number of the page number in the page number directory entry, digital document catalog page or leaf and the range threshold parameter of setting, determine candidate's page or leaf at logical page (LPAGE) place of the page number correspondence of this catalogue entry, promptly determine candidate's page or leaf at the logical page (LPAGE) place of this catalogue entry correspondence.
Be specially when the page number is n in the page number directory entry, the total page number of digital document catalog page or leaf is K, and the range threshold parameter of She Dinging is D simultaneously, and then candidate's page or leaf N at the logical page (LPAGE) place of this page number n correspondence is as can be known: n+K-D≤N≤n+K+D.The size of range threshold parameter D can be set in the actual calculation process as required flexibly, adopt suitable range threshold parameter to reach and improve the efficient that link is set up, also can satisfy the requirement of accuracy simultaneously.
S302: in each candidate's page or leaf, extract effective information.
Specifically comprise: according to the information of the type page scope of preserving, reach the coordinate of each character in each candidate's page or leaf, determine to be positioned at the extraneous character of this type page, from the extraneous character of this type page, extract numerical character.Promptly determine the character in the extraneous headerfooter of type page, from this character, extract numerical character.Wherein, the information of the type page scope of preservation comprises: the upper border line of type page scope, following boundary line, left side boundary and boundary line, the right information,
Wherein the coordinate of each character comprises the coordinate of this character of determining according to the minimum boundary rectangle frame of this character, the coordinate of character is with the coordinate representation on the summit at two diagonal angles of its minimum boundary rectangle frame, as Fig. 4, the coordinate of character " order " can adopt the coordinate representation of summit 1 and 3, perhaps adopt the coordinate representation of summit 2 and 4, for example adopt the coordinate of the coordinate representation character of summit 1 and 3, the coordinate representation of this character is (x
1, y
1, x
2, y
2), x
1Be the horizontal ordinate on summit 1, i.e. the distance of summit 1 range coordinate axle y, y
1Be the ordinate on summit 1, i.e. the distance of summit 1 range coordinate axle x, x
2Be the horizontal ordinate on summit 3, i.e. the distance of summit 3 range coordinate axle y, y
2Be the ordinate on summit 3, i.e. the distance of summit 3 range coordinate axle x.
S303: merge the effective information that extracts.
According to the numerical character information of extracting, judge whether per two intercharacter distances of numeral surpass the spacing threshold value of setting, when the spacing of two numerical characters does not surpass the spacing threshold value of setting, these two numerical characters are merged into a digit strings; Otherwise think that these two numerical characters are two independently digit strings.
Wherein, when whether the spacing of judging per two numerical characters surpasses the spacing threshold value of setting, can judge according to the coordinate of per two numerical characters, the method of determining each character coordinates as shown in Figure 4, judge at first whether two characters can be thought in same delegation, wherein concrete deterministic process can compare the ordinate of two characters, when the absolute value of difference of ordinate of two numerical character correspondences during less than the first condition value set, judge that these two numerical characters are in same delegation, otherwise different rows, Dui Ying ordinate wherein promptly when the ordinate on the summit 3 of adopting a numerical character, also should adopt the ordinate on second digit character summit 3; Whether the level interval of judging two numerical characters among the colleague then satisfies the second condition value of setting, the abscissa value that for example compares two numerical characters, when the absolute value of the difference of the horizontal ordinate of two horizontal ordinate correspondences during, judge that then these two characters can merge into a digit strings less than the second condition value set.Certainly in concrete computation process, can also adopt coordinate to determine whether two numerical characters merge into the method for a digit strings, just not give unnecessary details one by one here according to other.
S304:, determine the logical page (LPAGE) of correspondence in this catalogue entry according to the result of coupling with the page number information of page number directory entry in the catalogue entry and the effective information coupling of merging.
Specifically comprise: the page number of page number directory entry in the catalogue entry and each character string after the merging are carried out the comparison of size, whether identical according to each character string after merging with the page number, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.Can choose the highest candidate's page or leaf of the first catalogue entry degree of confidence, as logical page (LPAGE) corresponding in this catalogue entry.
Wherein can comprise in the specific implementation process: at first set an identical initial degree of confidence X for each the candidate's page or leaf in logical page (LPAGE) place candidate's page or leaf of the page number correspondence in the catalogue entry page number directory entry, with each the digit strings coupling after the merging in the page number in the catalogue entry page number directory entry and each the candidate's page or leaf, whenever find one with this catalogue entry in the digit strings of the page number coupling time, the degree of confidence of this candidate's page or leaf correspondence is added Y, when whenever find one with this catalogue entry in the unmatched digit strings of the page number time, the degree of confidence of this candidate's page or leaf correspondence is subtracted E, thereby determine that this candidate's page or leaf is to first degree of confidence that should catalogue entry.For example this candidate's page or leaf has altogether to merge and has obtained 5 digit strings, the original execution degree of this candidate's page or leaf is X, page number coupling in a digit strings and the catalogue entry is arranged, and the page number in 4 digit strings and the catalogue entry does not match, and then the degree of confidence of this candidate's page or leaf correspondence is X+Y-4E as can be known.Wherein, X, Y and E are the arithmetic number greater than zero.
Simultaneously in embodiments of the present invention also can be according to the head table item in the catalogue entry, determine the logical page (LPAGE) corresponding with each catalogue entry, as shown in Figure 5, for the embodiment of the invention provide according to the head table item in the catalogue entry, determine the method for each logical page (LPAGE) of each catalogue entry correspondence, specifically may further comprise the steps:
S501: the coordinate according to all characters in every page is arranged as several rows with all characters in every page.
Specifically comprise: in the page of each candidate's page or leaf, with all character orderings, judge at first whether per two characters are same delegation, writing direction with the digital document catalog item is that horizontally-arranged is an example, can whether be no more than the spacing parameter h that presets according to the spacing of the vertical direction of judging two characters, and wherein h is an arithmetic number, when two characters pitch spacing between vertical direction is not more than spacing parameter h, then two characters are arranged in delegation, otherwise, two characters are not arranged in delegation; In each row, the principle that increases progressively successively according to horizontal ordinate is with the character ordering of every row then.As shown in Figure 4, the minimum boundary rectangle frame that then obtains this journey after the ordering is (x
m, y
m, x
n, y
n), the minimum boundary rectangle frame of all characters is included in the minimum boundary rectangle frame of this row in this row, wherein x
mBe the abscissa value of high order end character 1 in this journey, this horizontal ordinate can be the horizontal ordinate of the left upper apex of this character or the horizontal ordinate on summit, lower-left, y
mBe in this journey ordinate value of character 3 topmost, this ordinate can be the ordinate of the left upper apex of this character or the ordinate on upper right summit, x
nBe the abscissa value of low order end character 4 in this journey, this horizontal ordinate can be the horizontal ordinate on the upper right summit of this character or the horizontal ordinate on summit, bottom right, y
nBe the ordinate value of character 2 bottom in this journey, this ordinate can be the ordinate on the summit, lower-left of this character or the ordinate on summit, bottom right.
S502: in every row with the character of this row, with the heading message coupling in the head table item in the catalogue entry of preserving.
Concrete matching process comprises: (Longest Common Subsequence, LCS) algorithm carry out the coupling of similarity between character string, and this character string comprises: the title in the catalogue entry and the character of every row according to Longest Common Substring; Again according at least one characteristic information that is provided with, determine total degree of confidence of the corresponding head table item of this row information simultaneously.
Wherein, at least one characteristic information of this setting comprises: the position of this line character in the digital document page at its place, or the size of the average literal width of the average literal width of this line character and body part, or whether go together with other alphabetic character according to the character string of LCS algorithmic match.Can determine total degree of confidence of the corresponding head table item of every line character information according to above-mentioned at least one characteristic information and LCS algorithm.
S503:, determine logical page (LPAGE) corresponding in this catalogue entry according to the matching result of every row in every page and heading message.
Matching degree according to every line character and heading message obtains total degree of confidence, with every row is corresponding in every page the highest total degree of confidence as this page to the second catalogue entry degree of confidence that should catalogue entry, according to every page to the second catalogue entry degree of confidence that should catalogue entry, determine the logical page (LPAGE) of each catalogue entry correspondence.
Certainly by the heading message in the catalogue entry, determine the logical page (LPAGE) of each catalogue entry correspondence, the implementation procedure reliability height that it is concrete, but also can influence simultaneously the efficient that link foundation between catalogue and the text, therefore can adopt page number information and heading message in conjunction with the logical page (LPAGE) of determining each catalogue entry correspondence, specifically comprise: candidate's page or leaf of determining the logical page (LPAGE) place of this page number correspondence according to page number information, in each candidate's page or leaf, carry out the coupling of page number information, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry, in each candidate's page or leaf, mate simultaneously according to heading message, determine that each candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry, according to each candidate's page or leaf to the first catalogue entry degree of confidence that should catalogue entry, and each candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry, and setting page number information coupling and the corresponding weight coefficient of heading message coupling, determine total degree of confidence of the corresponding catalogue entry of each candidate's page or leaf, thereby determine the logical page (LPAGE) of each catalogue entry correspondence.
As shown in Figure 6A,, specify and set up the method that links between digital document catalog and the text, specifically may further comprise the steps for " psychological health education " of publishing in 2006 with publishing house of University of the Inner Mongol in the embodiment of the invention is example:
Step 601: read in digital document, obtain the catalogue entry information of preservation.
This digital document has 236 pages, 59 of catalogue entries.Be that example is described the process of setting up between this catalogue entry and the text that links in detail wherein with the catalogue entry of chapter 2 first segment correspondence shown in Figure 7.The chapters and sections directory entry is " first segment " in this catalogue entry, and the head table item is " self-consciousness general introduction ", and page number directory entry is " 20 ".
Step 602:, determine the logical page (LPAGE) of each catalogue entry correspondence according at least one directory entry information of the information setting of the catalogue entry of preserving.
Step 603:, set up the link of each logical page (LPAGE) of each catalogue entry and correspondence according to the logical page (LPAGE) of each catalogue entry correspondence of determining.
Be depicted as in embodiments of the present invention according to the page number information of the page number directory entry in the catalogue entry and the heading message in the head table item as Fig. 6 B, determine the method for the logical page (LPAGE) of each catalogue entry correspondence, determine that specifically the process of the logical page (LPAGE) of each catalogue entry correspondence comprises:
Step 602a:, determine candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence according to the page number information in the page number directory entry in the catalogue entry.
Wherein the page number is 20 in the page number directory entry of this catalogue entry, concern n+K-D≤N≤n+K+D according to the candidate's page or leaf at the logical page (LPAGE) place that is provided with and the page number, wherein the total page number K of this digital document is 5, range threshold parameter D is 3, determines that then candidate's page or leaf at the logical page (LPAGE) place of this catalogue entry correspondence is the 22nd page to the 28th page.
Step 602b: in each candidate's page or leaf, extract effective information.
The type page range information of this digital document of Bao Cuning is in embodiments of the present invention, and the upper border line ordinate is 80.73, and the horizontal ordinate in boundary line, the left side is 0, and the horizontal ordinate in boundary line, the right is 485, and the ordinate in following boundary line is 697.10.In each candidate's page or leaf, according to the coordinate of each character, determine the extraneous character of type page, and from the extraneous character of type page, extract numerical character.In concrete computation process, as shown in Figure 4, the coordinate of each character is determined according to the summit 1 of the minimum boundary rectangle frame of this character and the coordinate on summit 3, when the ordinate on the summit 1 of character littler than 80.73, or the ordinate on summit 3 is bigger than 697.10, or the horizontal ordinate on summit 1 is littler than 0, or the horizontal ordinate on summit 3 is when bigger than 485, all thinks being positioned at outside the type page scope of this character.
Extract numerical character being arranged in the extraneous character of type page.The numerical character that proposes is merged.For example extract numerical character and be " 7 " and " 1 ", wherein the coordinate of numerical character " 7 " is (421.05,699.83,425.76,706.94), and the coordinate of numerical character " 1 " is (416.74,699.83,419.47,706.94).The numerical character that extracts is sorted according to its coordinate.For example the horizontal ordinate on numerical character " 1 " summit 1 is greater than the horizontal ordinate on numerical character " 7 " summit 1, the horizontal ordinate on numerical character " 1 " summit 3 is less than the horizontal ordinate on numerical character " 7 " summit 3, and numerical character " 1 " is identical with the ordinate of " 7 " correspondence, then two numerical characters are in same delegation as can be known, and numerical character " 1 " is on the left side of numerical character " 7 ".
The difference of the horizontal ordinate on the horizontal ordinate on numerical character " 7 " summit 1 and numerical character " 1 " summit 3 is 1.58 simultaneously, the spacing threshold value of setting is 2.37 to 4.71 a numerical value, then two numerical characters can be merged into a digit strings as can be known, this digit strings is " 17 " after the merging, and the coordinate of this digit strings is (416.74,699.83,425.76,706.94).
Step 602c: the page number information in the catalogue entry of the effective information that extracts and preservation is mated, determine that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.
Digit strings after merging and the page number information in the catalogue entry are compared.Page number information in this catalogue entry is 20, and after digit strings be 17, these two numerical characters are not inconsistent.Therefore the degree of confidence with candidate's page or leaf subtracts E, and the initial degree of confidence X of present embodiment is 50, and E is 6, and then the degree of confidence of this candidate page or leaf is 44 as can be known.
Adopting said method to obtain, is that the first catalogue entry degree of confidence that each the candidate's page or leaf in the 22nd page to the 28th page carries out obtaining after the page number coupling is respectively 44,44,44,80,44,44,44 for natural number of pages.
Step 602d: the coordinate according to all characters in every page is arranged as several rows with all characters in every page.
In order to guarantee that all characters are arranged according to row in each page or leaf, vertical range in the process of arranging between intercharacter horizontal median axis need meet some requirements, the vertical range of this intercharacter horizontal median axis can be according to the mean value of the ordinate on two summits of the top and bottom of calculating character, calculates the difference of mean value of ordinate of the correspondence of two characters again and determine.Judge that in embodiments of the present invention the method whether two character A and B can come delegation is: the mean value of two ordinates of calculating character A, and the difference of the bigger ordinate of calculating character A and less ordinate, the mean value of two ordinates of while calculating character B, and the difference of the bigger ordinate of calculating character B and less ordinate, the difference of the mean value of the ordinate of judgement character A, B correspondence, whether, promptly judge less than the difference of bigger ordinate less among two character A, the B and less ordinate and the product of parameter:
Wherein, MIN represents to get smaller value among both, and j is the arithmetic number less than 1, Y
1(A) be the less ordinate value of character A, Y
2(A) be the bigger ordinate value of character A, Y
1(B) be the less ordinate value of character B, Y
2(B) be the bigger ordinate value of character B.When judged result when being, A and B are aligned to delegation, otherwise A and B are aligned to different rows, judge successively then whether the ordinate of B and two characters of C satisfies above-mentioned condition, judge whether B and C are aligned to delegation.Adopt the method that all characters in each page or leaf are arranged.After adopting the method to arrange, the corresponding minimum boundary rectangle frame of each row, as shown in Figure 4.
Step 602e:, determine the second catalogue entry degree of confidence of the corresponding catalogue entry of each candidate's page or leaf with the every row and the coupling of the heading message in the catalogue entry of each candidate's page or leaf.
Step 602f: according to first catalogue entry degree of confidence of corresponding each catalogue entry of each candidate's page or leaf and second degree of confidence of corresponding each catalogue entry, determine total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf, determine the logical page (LPAGE) of each catalogue entry correspondence according to this total degree of confidence.
According to the first catalogue entry degree of confidence dPageVeri of the corresponding catalogue entry of each candidate's page or leaf and second catalogue entry degree of confidence+dTitleVeri of corresponding catalogue entry, and the weight coefficient dTitleWeight of the weight coefficient dPageWeight of the first catalogue entry degree of confidence correspondence and the second catalogue entry degree of confidence correspondence, determine total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf, wherein the weight coefficient that the weight coefficient dPageWeight of the first catalogue entry degree of confidence correspondence is corresponding with the second catalogue entry degree of confidence dTitleWeight's and be 1, and all be arithmetic number greater than zero, for example dPageWeight is 0.4, and dPageWeight is 0.6.Adopting the above-mentioned method of determining total degree of confidence to obtain as shown in Figure 8 nature page is that total degree of confidence of candidate's page or leaf of the 25th page is 94.Select the logical page (LPAGE) of the highest candidate's page or leaf of total degree of confidence for this catalogue entry correspondence.Simultaneously also can set total confidence threshold value, with the candidate page or leaf of total degree of confidence, as the logical page (LPAGE) of this catalogue entry correspondence above total confidence threshold value.
Wherein, the process of determining the second catalogue entry degree of confidence of the corresponding catalogue entry of each candidate's page or leaf comprises:
Be illustrated in figure 8 as nature page or leaf after the arrangement that the embodiment of the invention provides and be the 25th page content, adopt the LCS algorithm that the character of every row in candidate's page or leaf and the heading character in the catalogue entry are mated, determine that according to the result of coupling every row in candidate's page or leaf is to the first directory entry information degree of confidence that should head table item information, the process of determining the second directory entry information degree of confidence comprises: the position of character in every row of determining every row in candidate's page or leaf, determine second degree of confidence of every row according to this position, and the average literal width size of the average literal width of more every line character and body part, determine the 3rd degree of confidence of every row, while is according to the character string of LCS algorithmic match success, determine whether this character string goes together with other alphabetic characters, determine the 4th corresponding degree of confidence of every row, second degree of confidence according to above-mentioned condition correspondence, the 3rd degree of confidence and the 4th degree of confidence, and the weight coefficient of each condition correspondence, determine that every row is to the second directory entry information degree of confidence that should head table item information, according to this first directory entry information degree of confidence and the second directory entry information degree of confidence, determine the composite catalog item information degree of confidence of every row.According to the composite catalog item information degree of confidence of every row in each page, the value of getting wherein total degree of confidence maximum as this candidate's page or leaf to the second catalogue entry degree of confidence that should catalogue entry.
Wherein adopt the LCS algorithm that the character of every row in candidate's page or leaf and the title in the catalogue entry are mated, the parameter of this algorithm input is two character strings, these two character strings are the character string of the title and the row to be matched of catalogue entry, return the longest public substring part of these two character strings as calculated, according to the longest common characters string part of returning, thereby the similarity that can determine two character strings is determined the first directory entry information degree of confidence of this row.For example the character " general introduction of first segment self-consciousness " of the 25th page of second row and the title " self-consciousness general introduction " of catalogue entry are mated according to the LCS algorithm, the result of output is " self-consciousness general introduction " after mating, result after the coupling is identical with the title of catalogue entry, can determine that then the first directory entry information degree of confidence of this row is 100.The high more row of the first directory entry information degree of confidence, then the similarity of the title of the character of this row and catalogue entry is high more.
And carry out the judgement of every line character position.The deterministic process of every line character position specifically comprises: compare the first definite axis of coordinate of being expert at according to the character of every row, with the absolute value of the horizontal range difference of second axis of determining according to the left and right boundary line of type page scope, determine second degree of confidence of every row according to the absolute value of this horizontal range difference.The high more row of second degree of confidence wherein, then the absolute value of this horizontal range difference is more little.
And when the position of the character of determining every row, also can determine whether the Position Approximate of the character of every row is positioned at the centre position of determining according to the left and right boundary line of type page scope according to the coordinate of every row, when the character of this row is positioned at this centre position, whether the length of determining this row according to the coordinate of this row satisfies the length condition of setting, for example the length condition that should set is less than 80% of whole type page scope boundary line, the right and boundary line, left side difference, whether the length according to this row satisfies the length condition of setting, and determines second degree of confidence of this row; When the character that perhaps also can work as this row is positioned at the left side of type page, whether the length of determining this row according to the coordinate of this row satisfies the length condition of setting, for example the length condition that should set is less than 70% of whole type page scope boundary line, the right and boundary line, left side difference, determines second degree of confidence of this row.
Wherein, when the Position Approximate of the character of determining every row according to the coordinate of every row, row-coordinate for example shown in Figure 4, can the coordinate of more every row in X
mWith first difference in the boundary line, the left side of type page scope, and the boundary line, the right and the X of type page scope
nThe size of second difference, when the difference of first difference and second difference during greater than the difference threshold set, judge that then the character of this row is positioned at the both sides of whole type page, and when first difference is big than second difference, judge that the character of this row is positioned at the right-hand member of whole type page, when first difference than second difference hour, judge that the character of this row is positioned at the left end of whole type page.Certainly in the position judgment process of reality, may also have a lot of methods, judge that according to the coordinate difference method of every line character position all should be in protection scope of the present invention but be based on embodiment of the invention thought.
Judge the size of the average literal width of the average literal width of every line character and digital document text simultaneously, for example the average literal width of second row is 13.77 among Fig. 8, the average literal width of text is 10.29 in this digital document, the average literal width of this second row is determined the 3rd degree of confidence of this row correspondence greater than the average literal width of text in the digital document.The high more row of the 3rd degree of confidence wherein, the average literal width of this line character is big more.
And, judge the alphabetic character that whether also has other in this row according to the character string of LCS algorithmic match success.Specifically judge in the process of the character that whether also has other in this row, will with the alphabetic character of the coordinate direct neighbor of this character string that the match is successful, be defined as existing in this row other alphabetic character, when there is indirect annexation in alphabetic character with this character that the match is successful, for example " the second joint self-consciousness general introduction ", the character that the match is successful is " self-consciousness general introduction ", with this character string on coordinate, link to each other for the space, there is indirect annexation in " second joint " with this character string, can think the alphabetic character that does not have other in the row at this character string place that the match is successful, the alphabetic character that whether has other in the row according to this character string place that the match is successful is determined the 4th degree of confidence of this row.
According to first directory entry information degree of confidence of every row and the second directory entry information degree of confidence of determining by second degree of confidence, the 3rd degree of confidence and the 4th degree of confidence, and the weight coefficient of each degree of confidence correspondence, determine the composite catalog item information degree of confidence of every row, get the highest value of composite catalog item information degree of confidence as each candidate's page or leaf to the second catalogue entry degree of confidence that should catalogue entry, wherein, the weight coefficient of each degree of confidence correspondence is an arithmetic number.
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, can be by obtaining at least one directory entry information according to the catalogue entry information of preserving, this at least one directory entry information is mated at the page of digital document, result according to coupling determines each logical page (LPAGE) of each catalogue entry correspondence, thereby sets up linking between each catalogue entry and this logical page (LPAGE).Adopt this to set up the method that links between digital document catalog and the text automatically, what link between the catalogue that can improve digital document effectively and the text sets up efficient, and then improves the make efficiency of digital document.
Simultaneously because in digital document, the nature page or leaf and the actual not corresponding principal element of logical page (LPAGE) that influence in the catalogue entry are text colophon before, catalogue page, preface, contents such as preface appendix, therefore as long as in the process that links between the catalogue of setting up digital document and the text, identify the preceding colophon of this text, catalogue page, preface, the number of pages of contents such as preface appendix, according to the page number information in the digital document catalog clauses and subclauses of this number of pages and preservation, also can determine the logical page (LPAGE) of each catalogue entry correspondence, thereby set up the link between each catalogue entry and each logical page (LPAGE).Believe in the concrete implementation procedure that those skilled in the art can carry out concrete enforcement according to the method that the embodiment of the invention provides, just do not give unnecessary details one by one here.
As shown in Figure 9, for the embodiment of the invention provides a kind of device that links between digital document catalog and the text of setting up, wherein said digital document catalog comprises a plurality of catalogue entries, and each catalogue entry comprises at least one directory entry information, comprising:
Logical page (LPAGE) identification module 90 is used for obtaining at least one directory entry information from each catalogue entry of preserving, according to described at least one directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document;
Module 91 is set up in link, is used to set up linking between each catalogue entry and corresponding each logical page (LPAGE).
Described logical page (LPAGE) identification module 90 comprises:
First recognition unit 901 is used for when at least one the directory entry information that obtains is page number directory entry information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described first recognition unit 901 comprises:
First candidate's page or leaf is determined subelement 9010, is used for determining candidate's page or leaf at the logical page (LPAGE) place of described each catalogue entry correspondence according to the page number directory entry information of the rule that presets according to each catalogue entry;
The first coupling subelement 9011 is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first computation subunit 9012 is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to first degree of confidence that should catalogue entry;
Logical page (LPAGE) first is determined subelement 9013, is used for determining according to described first degree of confidence each logical page (LPAGE) of each catalogue entry correspondence.
Described logical page (LPAGE) identification module 90 comprises:
Second recognition unit 902 is used for when at least one directory entry information of obtaining is head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described second recognition unit 902 comprises:
Row degree of confidence first is determined subelement 9020, is used for the similarity according to every line character in the head table item information of each catalogue entry and the every number of pages word document, determines that every line character in this number of pages word document is to first degree of confidence that should head table item information;
Row degree of confidence second is determined subelement 9021, is used at least one characteristic information according to the every line character of this number of pages word document, determines that every line character in this number of pages word document is to second degree of confidence that should head table item information;
Second computation subunit 9022 is used for according to the every line character of this number of pages word document total degree of confidence that should head table item information, determines that this number of pages word document is to second degree of confidence that should catalogue entry;
Logical page (LPAGE) second is determined subelement 9023, is used for determining according to described second degree of confidence each logical page (LPAGE) of each directory entry information correspondence.
The degree of confidence second of wherein going determines that at least one characteristic information in the subelement 9021 comprises: the average literal width information of every line character in the positional information of every line character or the digital document in the digital document, or whether digital document and the on all four character of described head table project information go together with other alphabetic characters.
Described logical page (LPAGE) identification module 90 comprises:
The 3rd recognition unit 903 is used for when described at least one directory entry information is page number directory entry information and head table item information, determines each logical page (LPAGE) of each catalogue entry correspondence in digital document.
Described the 3rd recognition unit 903 comprises:
Second candidate's page or leaf is determined subelement 9030, is used for the page number directory entry information according to each catalogue entry, determines candidate's page or leaf at the logical page (LPAGE) place of each catalogue entry correspondence;
First degree of confidence is determined subelement 9031, is used for determining first degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Second degree of confidence is determined subelement 9032, is used for determining second degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Total degree of confidence is determined subelement 9033, is used for determining according to described first degree of confidence and described second degree of confidence total degree of confidence of corresponding each catalogue entry of each candidate's page or leaf;
Logical page (LPAGE) the 3rd is determined subelement 9034, is used for determining according to described total degree of confidence the logical page (LPAGE) of each catalogue entry correspondence.
Described first degree of confidence determines that subelement 9031 comprises:
First matched sub-block is used for extracting effective information at each candidate's page or leaf, and relatively whether each effective information is identical with page number directory entry information in this catalogue entry;
Whether first calculating sub module is used for according to each effective information identically with described page number directory entry information, determines that each candidate's page or leaf is to the first catalogue entry degree of confidence that should catalogue entry.
Described second degree of confidence determines that subelement 9032 comprises:
Row degree of confidence first is determined submodule, is used for the similarity according to every line character in the head table item information of each catalogue entry and each the candidate's page or leaf, determines that every line character in this candidate's page or leaf is to first a record item information degree of confidence that should head table item information;
Row degree of confidence second is determined submodule, is used at least one characteristic information according to the every line character of this candidate's page or leaf, determines that every line character in this candidate's page or leaf is to the second directory entry information degree of confidence that should head table item information;
Second calculating sub module is used for according to the every line character of this candidate's page or leaf total degree of confidence that should head table item information, determines that this candidate's page or leaf is to the second catalogue entry degree of confidence that should catalogue entry.
A kind of method that links between digital document catalog and the text of setting up that the embodiment of the invention provides, can be by obtaining at least one directory entry information according to the catalogue entry information of preserving, this at least one directory entry information is mated at the page of digital document, result according to coupling determines each logical page (LPAGE) of each catalogue entry correspondence, thereby sets up linking between each catalogue entry and this logical page (LPAGE).Adopt this to set up the method that links between digital document catalog and the text automatically, what link between the catalogue that can improve digital document effectively and the text sets up efficient, and then improves the make efficiency of digital document.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.