Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is described in further detail.
Figure 1A is refer to, the method flow of the recessive table extracting method provided it illustrates one embodiment of the invention
Figure.The recessive table extracting method may include steps of:
Step 101, destination document is parsed, obtains coordinate corresponding to each character and each character in destination document.
Optionally, the destination document is PDF document or picture.
Destination document is parsed successively by the page number, travels through the every page of destination document, is obtained each in destination document
Coordinate corresponding to each character and each character of page.
Step 102, the coordinate according to corresponding to each character, it is same that distance is met that the default character close to condition is defined as
Character in recessive form, the character in same recessive form is subdivided into same character set.
Coordinate corresponding to the character being previously mentioned in the present embodiment is the centre coordinate of the character.
Optionally, the default difference close to condition between abscissa is less than between the first predetermined threshold value or ordinate
Difference be less than the second predetermined threshold value.
In the page object obtained after being parsed to destination document, successively between more each character corresponding coordinate away from
From the difference being less than the difference between abscissa between the character or ordinate of the first predetermined threshold value is less than the second predetermined threshold value
Character be defined as character in same recessive form, the character in same recessive form is subdivided into same character set.
It should be noted that judging whether the distance of coordinate corresponding to two characters meets to preset close to before necessity of condition
Carrying is, the two characters are laterally or longitudinally to go up one of most short character of coordinate distance.
Figure 1B is refer to, it illustrates what one embodiment of the invention provided to be subdivided into the character in same recessive form
The schematic diagram of same character set.In the target pages 10 obtained after being parsed to destination document between coordinate corresponding to each character
Distance be compared successively, for example by character set " ABCDEDGHIJKLMN ", character " A " is corresponding with character " B " horizontal
Difference between coordinate is less than the first predetermined threshold value, and the difference between character " B " abscissa corresponding with character " C " is less than first
Predetermined threshold value, the difference between character " C " abscissa corresponding with character " D " are less than the first predetermined threshold value, character " D " and character
Difference between abscissa corresponding to " E " is less than the first predetermined threshold value, between character " E " abscissa corresponding with character " F "
Difference is less than the first predetermined threshold value, and the difference between character " F " abscissa corresponding with character " G " is less than the first predetermined threshold value,
Difference between character " G " abscissa corresponding with character " H " is less than the first predetermined threshold value, and character " H " is corresponding with character " I "
Difference between abscissa is less than the first predetermined threshold value, and the difference between character " A " ordinate corresponding with character " F " is less than the
Two predetermined threshold values, the difference between character " B " ordinate corresponding with character " G " are less than the second predetermined threshold value, character " C " and word
Difference corresponding to symbol " H " between ordinate is less than the second predetermined threshold value, between character " D " ordinate corresponding with character " I "
Difference be less than the second predetermined threshold value.
Such as character set " ABCDEFGHI " and character set " 1234567890 ", because character " E " and character " 1 " are right
Difference between the abscissa answered is more than the first predetermined threshold value, therefore character set " ABCDEFGHI " and character set
" 1234567890 " are two different character sets.Character set " ABCDEFGHI " and character set " DDDDDD " for another example,
Because the difference between character " F " ordinate corresponding with character " D " is more than the second predetermined threshold value, therefore character set
" ABCDEFGHI " and character set " DDDDDD " character set that to be two different.
Optionally, the first predetermined threshold value is related to the width of character, the height correlation of the second predetermined threshold value and character.
Optionally, the first predetermined threshold value and the second predetermined threshold value are default value, the predetermined threshold value be systemic presupposition or
It is artificial default.
It should be noted that the first predetermined threshold value can be the same or different with the second predetermined threshold value.
Step 103, character coordinates corresponding to the character in each character set, it is determined that corresponding to each character set
Range of cells.
In a kind of mode in the cards, because range of cells corresponding to character set is to include the character set actually
The scope of the apex coordinate of the rectangular unit grid of all characters in conjunction, therefore in order that obtain range of cells corresponding to character set
All characters in the character set are included, can be according to the minimum abscissa of each character, maximum in the character set
Abscissa, the ordinate of minimum ordinate and maximum determine range of cells corresponding to the character set.
Fig. 1 C are the sides of range of cells method corresponding to each character set of determination of one embodiment of the invention offer
Method flow chart.As shown in Figure 1 C, step 103 can be substituted by step 103a to step 103f.
Step 103a, for each character set, in character set in the abscissa of each character, it is determined that minimum horizontal seat
Mark and maximum abscissa, in character set in the ordinate of each character, it is determined that minimum ordinate and maximum ordinate.
In character set in the abscissa of each character, minimum abscissa and maximum abscissa can be considered the character set
The abscissa of two longitudinal boundaries of range of cells corresponding to conjunction;It is minimum in character set in the ordinate of each character
Ordinate and maximum ordinate can be considered the abscissa of two horizontal boundaries of range of cells corresponding to the character set.By
Abscissa in character set can be in the border of range of cells determined by the abscissa of minimum and maximum abscissa
Character in the range of the abscissa is included, the side of range of cells determined by minimum ordinate and maximum ordinate
The character that ordinate in character set is in the range of the ordinate can be included by boundary, therefore, according to each in character set
The minimum abscissa of individual character, maximum abscissa, minimum ordinate and maximum ordinate can determine that including the word
The range of cells of all characters in symbol set.
Step 103b, minimum abscissa is subtracted to the value obtained after the first predetermined value, second is subtracted with minimum ordinate
The value obtained after predetermined value, it is defined as the first apex coordinate of range of cells corresponding to character set.
Step 103c, minimum abscissa is subtracted to the value obtained after the first predetermined value, second is added with maximum ordinate
The value obtained after predetermined value, it is defined as the second apex coordinate of range of cells corresponding to character set.
Step 103d, maximum abscissa is added to the value obtained after the first predetermined value, second is subtracted with minimum ordinate
The value obtained after predetermined value, it is defined as the 3rd apex coordinate of range of cells corresponding to character set.
Step 103e, maximum abscissa is added to the value obtained after the first predetermined value, second is added with maximum ordinate
The value obtained after predetermined value, it is defined as the 4th apex coordinate of range of cells corresponding to character set.
Step 103f, according to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th apex coordinate, really
Determine range of cells corresponding to character set.
Because coordinate corresponding to character is the centre coordinate of the character, and determine the minimum horizontal seat of the range of cells
Mark, maximum abscissa, minimum ordinate and maximum ordinate are obtained according to the coordinate of each character in the character set
Arrive, if therefore directly according to the coordinate determining unit lattice scope of character, the border gesture of cell corresponding to the range of cells
In must can not being entirely included in the partial character in the character set.Fig. 1 D are the determination words that one embodiment of the invention provides
The schematic diagram of range of cells corresponding to symbol set.As shown in figure iD, using in the abscissa of each character in character set most
Small abscissa x1 and maximum abscissa x2, and minimum ordinate y1 in character set in the ordinate of each character and
Maximum ordinate y2, the range of cells 20 determined, the partial character in the character set can not be entirely included in interior.
In order to avoid the range of cells determined, the partial character in the character set can not be entirely included in
Situation occurs, and needs the font size or character height and width of character to take into account during determining unit lattice scope.
Optionally, the first predetermined value is related to the width of character, the height correlation of the second predetermined value and character.Such as
First predetermined value is the width of character, and the second predetermined value is the height of character;Or first predetermined value be character width
The half of degree, the second predetermined value are the half of the height of character.Wherein, the width of each character and length is in a step 101
What parsing obtained.
Optionally, the first predetermined value and the second predetermined value are default value, the default value be systemic presupposition or
It is artificial default.
Referring also to Fig. 1 D, minimum abscissa x1 is subtracted to the value x1` obtained after the first predetermined value, with minimum ordinate y1
The value y1` obtained after the second predetermined value is subtracted, is defined as the first apex coordinate (x1 of range of cells corresponding to character set
`, y1`).Minimum abscissa x1 is subtracted to the value x1` obtained after the first predetermined value, it is predetermined plus second with maximum ordinate y2
The value y2` obtained after numerical value, it is defined as the second apex coordinate (x1`, y2`) of range of cells corresponding to character set.Will most
Big abscissa x2 adds the value x2` obtained after the first predetermined value, is obtained after subtracting the second predetermined value with minimum ordinate y1
Value y1`, be defined as the 3rd apex coordinate (x2`, y1`) of range of cells corresponding to character set.By maximum abscissa x2
Plus the value x2` obtained after the first predetermined value, the value y2` obtained after the second predetermined value is added with maximum ordinate y2, really
It is set to the 4th apex coordinate (x2`, y2`) of range of cells corresponding to character set.
Compared to the minimum abscissa x1 in the abscissa using each character in character set and maximum abscissa x2, with
And the minimum ordinate y1 in character set in the ordinate of each character and maximum ordinate y2, the cell model determined
Enclose 20, according to the first apex coordinate (x1`, y1`), the second apex coordinate (x1`, y2`), the 3rd apex coordinate (x2`, y1`) and
4th apex coordinate (x2`, y2`), range of cells 21 corresponding to the character set determined can be by character set
In each abscissa corresponding to character is entirely included in " XXXXXXXXXXXXXXXX ".
Step 104, coordinate and each character set corresponding to the character that is included according to each character set, each character
Range of cells corresponding to conjunction, generate dominant form.
According to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th apex coordinate, it is determined that and generating certain
After the four edges of one character set corresponding unit lattice, according to coordinate corresponding to each character of the character set, by each character
Insert in range of cells corresponding to the character set, generate dominant form.
In summary, recessive table extracting method provided in an embodiment of the present invention, according to each character pair in destination document
The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character
Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document
Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document
Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious
The effect of property form.
The corresponding cell of one character set, due to including more than one cell in form, therefore, it is determined that every
, it is necessary to the coordinate of the adjacent edge of adjacent cells lattice scope to belonging to same form after range of cells corresponding to individual character set
It is adjusted so that the adjacent edge of adjacent cells lattice scope overlaps.Fig. 2A is refer to, it illustrates another embodiment of the present invention
The method flow diagram of the recessive table extracting method of offer.The recessive table extracting method may include steps of:
Step 201, destination document is parsed, obtains coordinate corresponding to each character and each character in destination document.
Step 202, the coordinate according to corresponding to each character, it is same that distance is met that the default character close to condition is subdivided into
Character set.
Step 203, character coordinates corresponding to the character in each character set, it is determined that corresponding to each character set
Range of cells.
Step 204, the character set that the distance between corresponding range of cells is less than to the 3rd predetermined threshold value is defined as
Character set in same recessive form, the character set in same recessive form is subdivided into identity set group.
Optionally, the distance between two range of cells between the centre coordinate of the two range of cells away from
From.
Optionally, the distance between two range of cells, it is phase in apex coordinate corresponding to the two range of cells
The distance between adjacent vertices coordinate.
Referring also to Figure 1B, such as, range of cells 11 corresponding to character " ABCDEFGHI " and character " 1234567890 " are right
The distance between range of cells 12 answered is less than the 3rd predetermined threshold value, therefore by character " ABCDEFGHI " and character
" 1234567890 " are defined as the character set in same recessive form, and by character " ABCDEFGHI " and character
" 1234567890 " are subdivided into identity set group.Such as range of cells 11 and character corresponding to character " ABCDEFGHI "
The distance between range of cells 13 corresponding to " DDDDDD " is less than the 3rd predetermined threshold value, therefore by character " ABCDEFGHI " and word
Symbol " DDDDDD " is subdivided into identity set group.For another example, range of cells 13 and character corresponding to character " DDDDDD "
The distance between range of cells 14 corresponding to " FFFFFFF " is more than the 3rd predetermined threshold value, therefore judges character " DDDDDD " and word
Symbol " FFFFFFF " is not belonging to the character set in same recessive form, that is, is not belonging to identity set group.
It should be noted that when two range of cells are pre- less than the 3rd with the distance between same range of cells respectively
, will be same hidden if during threshold value, character corresponding to the two range of cells is defined as into the character set in same recessive form
Character set in property form is subdivided into identity set group.Referring also to Figure 1B, such as cell corresponding to character " ABCDEFGHI "
The range of cells 13 corresponding with character " DDDDDD " of range of cells 12 corresponding to scope 11, character " 1234567890 " this three
In individual range of cells, the unit corresponding with character " 1234567890 " of range of cells 11 corresponding to character " ABCDEFGHI "
The distance between lattice scope 12 is less than the 3rd predetermined threshold value, range of cells 11 and character corresponding to character " ABCDEFGHI "
The distance between range of cells 13 corresponding to " DDDDDD " is less than the 3rd predetermined threshold value, by character " ABCDEFGHI ", character
" 1234567890 " and character " DDDDDD " are defined as the character set in same recessive form, and by character " ABCDEFGHI ",
Character " 1234567890 " is subdivided into identity set group with character " DDDDDD ".
Optionally, the width or height correlation of the 3rd predetermined threshold value and character.
Optionally, the 3rd predetermined threshold value is default value, and the default value is systemic presupposition or artificially preset.
In a kind of mode in the cards, due to belonging to the usual phase of cell formats corresponding to the cell of same form
Together, therefore the distance between the range of cells according to corresponding to two cells judges whether the two cells belong to same
Before form, it need to first judge whether cell formats are identical corresponding to the two cells.Fig. 2 B are one embodiment of the invention
The method flow diagram that the character set in same recessive form is subdivided into identity set prescription method provided.As shown in Figure 2 B,
The step 204 can be replaced by:
Step 204a, the coordinate of the character included according to each character set, it is determined that unit corresponding to each character set
Sound of laughing formula.
It should be noted that the cell formats being previously mentioned in the present embodiment include but is not limited to alignment mode, word
Accord with font size, character line space, character background color.
Step 204b, the distance between corresponding range of cells is less than the 3rd predetermined threshold value, and corresponding cell
Form identical character set is defined as the character set in same recessive form, and the character set in same recessive form is drawn
It is divided into identity set group.
Referring also to Figure 1B, when cell formats are alignment mode, cell model corresponding to character " ABCDEFGHI "
Enclose the range of cells 13 corresponding with character " DDDDDD " of range of cells 12 corresponding to 11, character " 1234567890 " these three
In range of cells, due to the cell corresponding with character " DDDDDD " of range of cells 11 corresponding to character " ABCDEFGHI "
The distance between scope 13 is less than the 3rd predetermined threshold value, and corresponding alignment mode identical (i.e. range of cells 11 and list
Alignment mode corresponding to first lattice scope 13 is to align right), therefore by character " ABCDEFGHI " and character " DDDDDD "
The character set being defined as in same recessive form, and character " ABCDEFGHI " and character " DDDDDD " are subdivided into same collection
It is charge-coupled.Although the range of cells corresponding with character " 1234567890 " of range of cells 11 corresponding to character " ABCDEFGHI "
The distance between 12 are less than the 3rd predetermined threshold value, but corresponding alignment mode is differed (i.e. corresponding to range of cells 11
For alignment mode to align right, alignment mode corresponding to range of cells 13 is to left-justify), therefore by character
" ABCDEFGHI " and character " 1234567890 " are not belonging to the character set in same recessive form, that is, are not belonging to identity set
Group.
Step 205, it is charge-coupled for each collection, it is right in range of cells corresponding to the charge-coupled each character set included is collected
The coordinate of the adjacent vertex of adjacent cells lattice scope is adjusted, and the adjacent vertex after regulation is coincided.
The adjacent edge of adjacent cells lattice in usual same form overlaps, and the width of same row cell is identical,
It is identical with the height of a line cell.Because range of cells corresponding to character set is sat according to each character in character set
What mark determined, the width of range of cells corresponding to the different character set of character quantity and height can difference, therefore
, it is necessary to which the coordinate of the adjacent vertex of the adjacent cells lattice in same form is adjusted before generating dominant form, make regulation
Adjacent vertex afterwards coincides, so as to the display of optimization table.
First, adjacent cells lattice scope is determined:
In identity set group, if the minimum abscissa in range of cells A apex coordinate is with range of cells B's
The distance between maximum abscissa in apex coordinate, than in the apex coordinate of other range of cells in identity set group
Minimum abscissa and the distance between the maximum abscissa in range of cells B apex coordinate it is small, then identifying unit lattice
Scope A is range of cells B laterally adjacent range of cells.
In identity set group, if the maximum abscissa in range of cells A apex coordinate is with range of cells B's
The distance between minimum abscissa in apex coordinate, than in the apex coordinate of other range of cells in identity set group
Maximum abscissa and the distance between the minimum abscissa in range of cells B apex coordinate it is small, then identifying unit lattice
Scope A is range of cells B laterally adjacent range of cells.
In identity set group, if the minimum ordinate in range of cells A apex coordinate is with range of cells B's
The distance between maximum ordinate in apex coordinate, than in the apex coordinate of other range of cells in identity set group
Minimum ordinate and the distance between the maximum ordinate in range of cells B apex coordinate it is small, then identifying unit lattice
Scope A is range of cells B longitudinally adjacent range of cells.
In identity set group, if the maximum ordinate in range of cells A apex coordinate is with range of cells B's
The distance between minimum ordinate in apex coordinate, than in the apex coordinate of other range of cells in identity set group
Maximum ordinate and the distance between the minimum ordinate in range of cells B apex coordinate it is small, then identifying unit lattice
Scope A is range of cells B longitudinally adjacent range of cells.
Secondly, apex coordinate corresponding to the often row range of cells and each column range of cells in form is individually adjusted
(adjust apex coordinate corresponding to range of cells laterally adjacent to each other and adjust range of cells pair adjacent longitudinally of one another
The apex coordinate answered):
For in apex coordinate corresponding to range of cells laterally adjacent to each other, determining that maximum ordinate and minimum are indulged
Coordinate, the maximum ordinate determined is replaced in the range of cells laterally adjacent to each other in each range of cells most
Big ordinate, the minimum ordinate determined is replaced in the range of cells laterally adjacent to each other in each range of cells
Minimum ordinate;For in apex coordinate corresponding to range of cells adjacent longitudinally of one another, determine maximum abscissa and
Minimum abscissa, the maximum abscissa determined is replaced into each range of cells in adjacent range of cells longitudinally of one another
In maximum abscissa, the maximum abscissa determined is replaced into each cell in adjacent range of cells longitudinally of one another
Maximum abscissa in scope.
Finally, summit corresponding to the adjacent often row range of cells in form and adjacent each column range of cells is sat
Mark is adjusted:
In identity set group, when range of cells A is range of cells B laterally adjacent range of cells, if
Between minimum abscissa in range of cells A apex coordinate and the maximum abscissa in range of cells B apex coordinate
It is closest, then in the minimum abscissa in range of cells A apex coordinate and range of cells B apex coordinate
Maximum abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice scope A column cells that will be obtained
Minimum abscissa in scope in the apex coordinate of all range of cells and institute in range of cells B column range of cells
There is the maximum abscissa in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B laterally adjacent range of cells, if
Between maximum abscissa in range of cells A apex coordinate and the minimum abscissa in range of cells B apex coordinate
It is closest, then in the maximum abscissa in range of cells A apex coordinate and range of cells B apex coordinate
Minimum abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice scope A column cells that will be obtained
Maximum abscissa in scope in the apex coordinate of all range of cells and institute in range of cells B column range of cells
There is the minimum abscissa in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B longitudinally adjacent range of cells, if
Between minimum ordinate in range of cells A apex coordinate and the maximum ordinate in range of cells B apex coordinate
It is closest, then in the minimum ordinate in range of cells A apex coordinate and range of cells B apex coordinate
Maximum ordinate is averaged to obtain mean ordinate, and obtained mean ordinate replacement unit lattice scope A is expert at cell
Minimum ordinate in scope in the apex coordinate of all range of cells and range of cells B are expert at institute in range of cells
There is the maximum ordinate in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B longitudinally adjacent range of cells, if
Between maximum ordinate in range of cells A apex coordinate and the minimum ordinate in range of cells B apex coordinate
It is closest, then in the maximum ordinate in range of cells A apex coordinate and range of cells B apex coordinate
Minimum ordinate is averaged to obtain mean ordinate, and obtained mean ordinate replacement unit lattice scope A is expert at cell
Maximum ordinate in scope in the apex coordinate of all range of cells and range of cells B are expert at institute in range of cells
There is the minimum ordinate in the apex coordinate of range of cells.
Fig. 2 C are refer to, the adjacent vertex to adjacent cells lattice scope provided it illustrates one embodiment of the invention
Coordinate is adjusted, the schematic diagram for making the adjacent vertex after regulation coincide.Wherein, range of cells 30, range of cells 31,
Character set corresponding to range of cells 32 and range of cells 33 is identity set group
Step 1, determine adjacent cells lattice scope:
Due to the maximum abscissa (i.e. summit 30a and summit 30b abscissa) in the apex coordinate of range of cells 30
The distance between minimum abscissa (i.e. summit 31a and summit 31b abscissa) in the apex coordinate of range of cells 31,
Than the top with the range of cells 31 respectively of the maximum abscissa in the apex coordinate of range of cells 32 and range of cells 33
The distance between minimum abscissa in point coordinates is small, therefore, it is determined that range of cells 30 is the laterally adjacent of range of cells 31
Range of cells.
Due to the minimum ordinate (i.e. summit 30c and summit 30b ordinate) in the apex coordinate of range of cells 30
The distance between maximum ordinate (i.e. summit 32a and summit 32b abscissa) in the apex coordinate of range of cells 32,
Than the top with the range of cells 32 respectively of the minimum ordinate in the apex coordinate of range of cells 31 and range of cells 33
The distance between maximum ordinate in point coordinates is small, therefore, it is determined that range of cells 30 is the longitudinally adjacent of range of cells 32
Range of cells.
Can similarly obtain, range of cells 32 be range of cells 33 laterally adjacent range of cells, range of cells 31
For the longitudinally adjacent range of cells of range of cells 33.
Step 2, apex coordinate corresponding to the often row range of cells and each column range of cells in form is individually adjusted
It is whole:
For in apex coordinate corresponding to range of cells 30 and range of cells 31 laterally adjacent to each other, determining most
Big ordinate (i.e. summit 30a ordinate) and minimum ordinate (i.e. summit 30c, summit 30b and summit 31b vertical seat
Mark), the maximum ordinate determined is replaced and the maximum in the range of cells 31 laterally adjacent to each other of range of cells 30
Ordinate (i.e. 31a ordinate), sat because in apex coordinate corresponding to range of cells 30 and range of cells 31, minimum is vertical
Mark identical, therefore do not perform replacement step.
For in apex coordinate corresponding to adjacent longitudinally of one another range of cells 30 and range of cells 32, determining most
Big abscissa (i.e. summit 30a, summit 32b abscissa) and minimum abscissa (summit 32a abscissa), will be determined most
Small abscissa is replaced and minimum abscissa (the i.e. 30c horizontal stroke in range of cells 32 longitudinally of one another adjacent range of cells 30
Coordinate), because in apex coordinate corresponding to range of cells 30 and range of cells 32, maximum abscissa is identical, therefore does not hold
Row replacement step.
Similarly, apex coordinate corresponding to range of cells 32 and range of cells 33 laterally adjacent to each other is adjusted
It is whole, apex coordinate corresponding to adjacent longitudinally of one another range of cells 31 and range of cells 33 is adjusted.
In form after adjustment, often the minimum ordinate of row range of cells is identical with maximum ordinate, per column unit
The minimum abscissa of lattice scope is identical with maximum abscissa.
Step 3, to summit corresponding to the adjacent often row range of cells in form and adjacent each column range of cells
Coordinate is adjusted:
Due in the apex coordinate of the maximum abscissa in the apex coordinate of range of cells 30 and unit scope 31 most
The distance between small abscissa is nearest, therefore, by the coordinate where the maximum abscissa in the apex coordinate of range of cells 30
30a is defined as adjacent coordinates with the coordinate 31a where the minimum abscissa in the apex coordinate of range of cells 31, by cell
The coordinate 30b where maximum abscissa in the apex coordinate of scope 30 and the minimum in the apex coordinate of range of cells 31 are horizontal
Coordinate 31b where coordinate is defined as adjacent coordinates.To the coordinate 30a of range of cells 30 abscissa and range of cells 31
Coordinate 31a abscissa be averaged to obtain average abscissa, or the abscissa and list of the coordinate 30b to range of cells 30
The coordinate 31b of first lattice scope 31 abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice that will be obtained
The apex coordinate of all range of cells (range of cells 30 and range of cells 32) in the column range of cells of scope 30
In maximum abscissa (i.e. coordinate 30a, coordinate 30b, coordinate 32b and coordinate 32c abscissa) and the institute of range of cells 31
Minimum in column unit lattice scope in the apex coordinate of all range of cells (range of cells 31 and range of cells 32)
Abscissa (i.e. coordinate 31a, coordinate 31b, coordinate 33a and coordinate 33b abscissa).
Due in the apex coordinate of the minimum ordinate in the apex coordinate of range of cells 30 and unit scope 32 most
The distance between big ordinate is nearest, therefore, by the coordinate where the minimum ordinate in the apex coordinate of range of cells 30
30c is defined as adjacent coordinates with the coordinate 32a where the maximum ordinate in the apex coordinate of range of cells 32, by cell
The coordinate 30b where minimum ordinate in the apex coordinate of scope 30 indulges with the maximum in the apex coordinate of range of cells 32
Coordinate 32b where coordinate is defined as adjacent coordinates.To the coordinate 30c of range of cells 30 ordinate and range of cells 32
Coordinate 32a ordinate be averaged to obtain mean ordinate, or the ordinate and list of the coordinate 30b to range of cells 30
The coordinate 32b of first lattice scope 32 ordinate is averaged to obtain mean ordinate, the mean ordinate replacement unit lattice that will be obtained
Scope 30 is expert at the apex coordinates of all range of cells (range of cells 30 and range of cells 31) in range of cells
In minimum ordinate (i.e. coordinate 30c, coordinate 30b and coordinate 31b ordinate) and range of cells 32 be expert at unit
Maximum ordinate in lattice scope in the apex coordinate of all range of cells (range of cells 32 and range of cells 33) is (i.e.
Coordinate 32a, coordinate 32b and coordinate 33a ordinate).
Step 206, it is charge-coupled for each collection for including multiple character sets, according to each character set bag in collecting charge-coupled
Range of cells, generation are dominant corresponding to each character set during coordinate corresponding to the character that contains, each character and collection are charge-coupled
Form.
According to the coordinate of each range of cells after adjustment, it is determined that and generate corresponding to cell four edges after, root
According to coordinate corresponding to the collection character that each character set includes in charge-coupled, each character, each character is inserted into corresponding character
In range of cells corresponding to set, dominant form is generated.
It should be noted that because step 201 to step 203 is similar to step 103 with step 101, therefore the present embodiment
Explanation is not repeated to step 201 to step 203.
In summary, recessive table extracting method provided in an embodiment of the present invention, according to each character pair in destination document
The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character
Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document
Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document
Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious
The effect of property form.
Following is apparatus of the present invention embodiment, for the details of not detailed description in device embodiment, be may be referred to above-mentioned
One-to-one embodiment of the method.
Fig. 3 is refer to, Fig. 3 is the block diagram of the recessive form extraction element provided in one embodiment of the invention.
The device includes:Parsing module 301, the first division module 302, the first determining module 303 and generation module 304.
Parsing module 301, for parsing destination document, obtain corresponding to each character and each character in destination document
Coordinate;
First division module 302, for the coordinate according to corresponding to each character, distance is met into the default word close to condition
Symbol is defined as the character in same recessive form, and the character in same recessive form is subdivided into same character set;
First determining module 303, for character coordinates corresponding to the character in each character set, it is determined that each word
Range of cells corresponding to symbol set;
Generation module 304, for character, the coordinate corresponding to each character and every included according to each character set
Range of cells corresponding to individual character set, generate dominant form.
In a kind of possible implementation, the first division module 302, it is additionally operable to:
Difference difference between abscissa being less than between the first predetermined threshold value or ordinate is less than the second predetermined threshold value
Character be defined as character in same recessive form, the character in same recessive form is subdivided into same character set.
In a kind of possible implementation, the first determining module 303, including:
First determining unit, for for each character set, in character set in the abscissa of each character, it is determined that most
Small abscissa and maximum abscissa, in character set in the ordinate of each character, it is determined that minimum ordinate and maximum
Ordinate;
Second determining unit, for minimum abscissa to be subtracted into the value obtained after the first predetermined value, with minimum ordinate
The value obtained after the second predetermined value is subtracted, is defined as the first apex coordinate of range of cells corresponding to character set;
3rd determining unit, for minimum abscissa to be subtracted into the value obtained after the first predetermined value, with maximum ordinate
Plus the value obtained after the second predetermined value, it is defined as the second apex coordinate of range of cells corresponding to character set;
4th determining unit, for maximum abscissa to be added into the value obtained after the first predetermined value, with minimum ordinate
The value obtained after the second predetermined value is subtracted, is defined as the 3rd apex coordinate of range of cells corresponding to character set;
5th determining unit, for maximum abscissa to be added into the value obtained after the first predetermined value, with maximum ordinate
Plus the value obtained after the second predetermined value, it is defined as the 4th apex coordinate of range of cells corresponding to character set;
6th determining unit, for according to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th summit
Coordinate, determine range of cells corresponding to character set.
In a kind of mode in the cards, the device also includes:
Second division module, for it is determined that after range of cells corresponding to each character set, by corresponding unit
The distance between lattice scope is defined as the character set in same recessive form less than the character set of the 3rd predetermined threshold value, will be same
Character set in one recessive form is subdivided into identity set group;
Generation module, it is additionally operable to:
It is charge-coupled for each collection for including multiple character sets, according to the collection word that each character set includes in charge-coupled
Range of cells corresponding to each character set, generates dominant form during coordinate corresponding to symbol, each character and collection are charge-coupled.
In a kind of mode in the cards, the device also includes:
Second determining module, for the coordinate of the character included according to each character set, it is determined that each character set pair
The cell formats answered;
Second division module, is additionally operable to:
The distance between corresponding range of cells is less than the 3rd predetermined threshold value, and corresponding cell formats identical
Character set is defined as the character set in same recessive form, and the character set in same recessive form is subdivided into same collection
It is charge-coupled.
In a kind of mode in the cards, the device also includes:
Adjustment module, in the character included according to each character set, coordinate corresponding to each character and each
Range of cells corresponding to character set, it is charge-coupled for each collection before generating dominant form, collecting the charge-coupled each character included
In range of cells corresponding to set, the coordinate of the adjacent vertex of adjacent cells lattice scope is adjusted, makes the phase after regulation
Adjacent vertices coincides.
In summary, recessive form extraction element provided in an embodiment of the present invention, according to each character pair in destination document
The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character
Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document
Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document
Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious
The effect of property form.
It should be noted that:The recessive form extraction element provided in above-described embodiment is when extracting form, only with above-mentioned
The division progress of each functional module, can be as needed and by above-mentioned function distribution by different for example, in practical application
Functional module is completed, i.e., the internal structure of electronic equipment is divided into different functional modules, to complete whole described above
Or partial function.In addition, recessive form extraction element and recessive table extracting method embodiment category that above-described embodiment provides
In same design, its specific implementation process refers to embodiment of the method, repeated no more here.
Fig. 4 is a kind of block diagram of terminal according to an exemplary embodiment.The terminal 400 is embodied as the end in Fig. 1
End 140.For example, terminal 400 can be mobile phone, and computer, digital broadcast terminal, messaging devices, game console,
Tablet device, Medical Devices, body-building equipment, personal digital assistant etc..
Reference picture 4, terminal 400 can include following one or more assemblies:Processing component 402, memory 404, power supply
Component 406, multimedia groupware 408, audio-frequency assembly 410, input/output (I/O) interface 412, sensor cluster 414, Yi Jitong
Believe component 416.
Processing component 402 generally controls the integrated operation of terminal 400, is such as communicated with display, call, data, phase
The operation that machine operates and record operation is associated.Processing component 402 can refer to including one or more processors 418 to perform
Order, to complete all or part of step of above-mentioned method.In addition, processing component 402 can include one or more modules, just
Interaction between processing component 402 and other assemblies.For example, processing component 402 can include multi-media module, it is more to facilitate
Interaction between media component 408 and processing component 402.
Memory 404 is configured as storing various types of data to support the operation in terminal 400.These data are shown
Example includes the instruction of any application program or method for being operated in terminal 400, contact data, telephone book data, disappears
Breath, picture, video etc..Memory 404 can be by any kind of volatibility or non-volatile memory device or their group
Close and realize, as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM) are erasable to compile
Journey read-only storage (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 406 provides electric power for the various assemblies of terminal 400.Power supply module 406 can include power management system
System, one or more power supplys, and other components associated with generating, managing and distributing electric power for terminal 400.
Multimedia groupware 408 is included in the screen of one output interface of offer between terminal 400 and user.In some realities
Apply in example, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen can
To be implemented as touch-screen, to receive the input signal from user.Touch panel include one or more touch sensors with
Gesture on sensing touch, slip and touch panel.Touch sensor can the not only border of sensing touch or sliding action, and
And also detection and the duration and pressure touched or slide is related.In certain embodiments, multimedia groupware 408 includes
One front camera and/or rear camera.It is preceding during such as screening-mode or video mode when terminal 400 is in operator scheme
The multi-medium data of outside can be received by putting camera and/or rear camera.Each front camera and rear camera can
To be a fixed optical lens system or there is focusing and optical zoom capabilities.
Audio-frequency assembly 410 is configured as output and/or input audio signal.For example, audio-frequency assembly 410 includes a Mike
Wind (MIC), when terminal 400 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone by with
It is set to reception external audio signal.The audio signal received can be further stored in memory 404 or via communication set
Part 416 is sent.In certain embodiments, audio-frequency assembly 410 also includes a loudspeaker, for exports audio signal.
I/O interfaces 412 provide interface between processing component 402 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include but be not limited to:Home button, volume button, start button and lock
Determine button.
Sensor cluster 414 includes one or more sensors, and the state for providing various aspects for terminal 400 is commented
Estimate.For example, sensor cluster 414 can detect opening/closed mode of terminal 400, the relative positioning of component, such as component
For the display and keypad of terminal 400, sensor cluster 414 can be with the position of 400 1 components of detection terminal 400 or terminal
Put change, the existence or non-existence that user contacts with terminal 400, the orientation of terminal 400 or the temperature of acceleration/deceleration and terminal 400
Change.Sensor cluster 414 can include proximity transducer, be configured in no any physical contact near detection
The presence of object.Sensor cluster 414 can also include optical sensor, such as CMOS or ccd image sensor, for should in imaging
With middle use.In certain embodiments, the sensor cluster 414 can also include acceleration transducer, gyro sensor, magnetic
Sensor, pressure sensor or temperature sensor.
Communication component 416 is configured to facilitate the communication of wired or wireless way between terminal 400 and other equipment.Terminal
400 can access the wireless network based on communication standard, such as Wi-Fi, 2G or 3G, or combinations thereof.In an exemplary reality
Apply in example, communication component 416 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel
Breath.In one exemplary embodiment, communication component 416 also includes near-field communication (NFC) module, to promote junction service.Example
Such as, in NFC module radio frequency identification (RFID) technology can be based on, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology,
Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 400 can be believed by one or more application specific integrated circuits (ASIC), numeral
Number processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing above-mentioned each embodiment of the method
The downlink data packet collocation method of offer.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided
Such as include the memory 404 of instruction, above-mentioned instruction can be performed by the processor 418 of terminal 400 to complete above-mentioned downlink data packet
Collocation method.For example, non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM,
Tape, floppy disk and optical data storage devices etc..
It should be appreciated that it is used in the present context, unless context clearly supports exception, singulative " one
It is individual " (" a ", " an ", " the ") be intended to also include plural form.It is to be further understood that "and/or" used herein is
Referring to includes any of one or more than one project listed in association and is possible to combine.
Those skilled in the art will readily occur to the application its after considering specification and putting into practice invention disclosed herein
Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or
Person's adaptations follow the general principle of the application and including the undocumented common knowledges in the art of the application
Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following
Claim is pointed out.
It should be appreciated that the precision architecture that the application is not limited to be described above and is shown in the drawings, and
And various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only limited by appended claim.