CN107622041A - recessive table extracting method and device - Google Patents

recessive table extracting method and device Download PDF

Info

Publication number
CN107622041A
CN107622041A CN201710839286.6A CN201710839286A CN107622041A CN 107622041 A CN107622041 A CN 107622041A CN 201710839286 A CN201710839286 A CN 201710839286A CN 107622041 A CN107622041 A CN 107622041A
Authority
CN
China
Prior art keywords
character
range
cells
character set
coordinate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710839286.6A
Other languages
Chinese (zh)
Other versions
CN107622041B (en
Inventor
于闪闪
张青
程剑华
蒋宏飞
晋耀红
杨凯程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science and Technology (Beijing) Co., Ltd.
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201710839286.6A priority Critical patent/CN107622041B/en
Publication of CN107622041A publication Critical patent/CN107622041A/en
Application granted granted Critical
Publication of CN107622041B publication Critical patent/CN107622041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Processing (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a kind of recessive table extracting method and device, belong to technical field of data processing.Methods described includes:According to coordinate corresponding to each character, distance is met that the default character close to condition is defined as the character in same recessive form, the character in same recessive form is subdivided into same character set;Character coordinates corresponding to character in each character set, it is determined that range of cells corresponding to each character set;Range of cells corresponding to coordinate corresponding to the character that is included according to each character set, each character and each character set, generates dominant form.Solve extraction of the extractive technique of existing PDF document for the list data of PDF document, the problem of lacking corresponding processing mode;The coordinate for having reached the character in recessive form in destination document determines range of cells in recessive form, and the effect of dominant form is generated according to the range of cells determined.

Description

Recessive table extracting method and device
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of recessive table extracting method and device.
Background technology
With the fast development of computer and Internet technology, portable document format The application of (PortableDocumentFormat, PDF) is more and more extensive.
Because the initial purposes of design of PDF are intended merely to show document and printed document, not with other computer programs Carry out communicating the function with interacting.Therefore, the data included in PDF document, the extraction skill of corresponding PDF document need to be passed through Art, it could be used by other computer programs.
PDF document is mainly made up of data such as image, form and characters.The extractive technique of existing PDF document, substantially The character data in PDF document, but the extraction of the list data for PDF document can be extracted exactly, then are lacked corresponding Processing mode.
The content of the invention
In order to solve extraction of the extractive technique of existing PDF document for the list data of PDF document, lack corresponding The problem of processing mode, the embodiments of the invention provide a kind of recessive table extracting method and device.The technical scheme is as follows:
First aspect, there is provided a kind of recessive table extracting method, methods described include:
Destination document is parsed, obtains coordinate corresponding to each character and each character in the destination document;
According to coordinate corresponding to each character, distance is met that the default character close to condition is drawn and is defined as same recessive table Character in lattice, the character in same recessive form is divided into same character set;
Character coordinates corresponding to character in each character set, it is determined that cell model corresponding to each character set Enclose;
Corresponding to coordinate corresponding to the character that is included according to each character set, each character and each character set Range of cells, generate dominant form.
Second aspect, there is provided a kind of recessive form extraction element, described device include:
Parsing module, for parsing destination document, obtain each character in the destination document and each character corresponds to Coordinate;
First division module, for the coordinate according to corresponding to each character, distance is met into the default character close to condition The character being defined as in same recessive form, the character in same recessive form is subdivided into same character set;
First determining module, for character coordinates corresponding to the character in each character set, it is determined that each character Range of cells corresponding to set;
Generation module, for character, coordinate and each word corresponding to each character included according to each character set Range of cells corresponding to symbol set, generates dominant form.
The beneficial effect that technical scheme provided in an embodiment of the present invention is brought is:
By the coordinate according to corresponding to each character in destination document, determine corresponding to character set and the character set Range of cells, dominant form is generated, because coordinate can more accurately determining unit according to corresponding to character in character set Lattice scope;Therefore solve extraction of the extractive technique of existing PDF document for the list data of PDF document, lack corresponding The problem of processing mode;The coordinate for having reached the character in recessive form in destination document determines unit in recessive form Lattice scope, and according to the effect for the dominant form of range of cells generation determined.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.
Figure 1A is the method flow diagram for the recessive table extracting method that one embodiment of the invention provides;
Figure 1B be one embodiment of the invention provide the character in same recessive form is subdivided into same character set Schematic diagram;
Fig. 1 C are the sides of range of cells method corresponding to each character set of determination of one embodiment of the invention offer Method flow chart;
Fig. 1 D are the schematic diagrames of range of cells corresponding to the determination character set of one embodiment of the invention offer;
Fig. 2A is the method flow diagram for the recessive table extracting method that another embodiment of the present invention provides;
Fig. 2 B be one embodiment of the invention provide the character set in same recessive form is subdivided into identity set The method flow diagram of prescription method;
Fig. 2 C are that the coordinate for the adjacent vertex to adjacent cells lattice scope that one embodiment of the invention provides is adjusted, The schematic diagram for making the adjacent vertex after regulation coincide;
Fig. 3 is the block diagram of the recessive form extraction element provided in one embodiment of the invention;
Fig. 4 is a kind of block diagram of terminal according to an exemplary embodiment.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.
Figure 1A is refer to, the method flow of the recessive table extracting method provided it illustrates one embodiment of the invention Figure.The recessive table extracting method may include steps of:
Step 101, destination document is parsed, obtains coordinate corresponding to each character and each character in destination document.
Optionally, the destination document is PDF document or picture.
Destination document is parsed successively by the page number, travels through the every page of destination document, is obtained each in destination document Coordinate corresponding to each character and each character of page.
Step 102, the coordinate according to corresponding to each character, it is same that distance is met that the default character close to condition is defined as Character in recessive form, the character in same recessive form is subdivided into same character set.
Coordinate corresponding to the character being previously mentioned in the present embodiment is the centre coordinate of the character.
Optionally, the default difference close to condition between abscissa is less than between the first predetermined threshold value or ordinate Difference be less than the second predetermined threshold value.
In the page object obtained after being parsed to destination document, successively between more each character corresponding coordinate away from From the difference being less than the difference between abscissa between the character or ordinate of the first predetermined threshold value is less than the second predetermined threshold value Character be defined as character in same recessive form, the character in same recessive form is subdivided into same character set.
It should be noted that judging whether the distance of coordinate corresponding to two characters meets to preset close to before necessity of condition Carrying is, the two characters are laterally or longitudinally to go up one of most short character of coordinate distance.
Figure 1B is refer to, it illustrates what one embodiment of the invention provided to be subdivided into the character in same recessive form The schematic diagram of same character set.In the target pages 10 obtained after being parsed to destination document between coordinate corresponding to each character Distance be compared successively, for example by character set " ABCDEDGHIJKLMN ", character " A " is corresponding with character " B " horizontal Difference between coordinate is less than the first predetermined threshold value, and the difference between character " B " abscissa corresponding with character " C " is less than first Predetermined threshold value, the difference between character " C " abscissa corresponding with character " D " are less than the first predetermined threshold value, character " D " and character Difference between abscissa corresponding to " E " is less than the first predetermined threshold value, between character " E " abscissa corresponding with character " F " Difference is less than the first predetermined threshold value, and the difference between character " F " abscissa corresponding with character " G " is less than the first predetermined threshold value, Difference between character " G " abscissa corresponding with character " H " is less than the first predetermined threshold value, and character " H " is corresponding with character " I " Difference between abscissa is less than the first predetermined threshold value, and the difference between character " A " ordinate corresponding with character " F " is less than the Two predetermined threshold values, the difference between character " B " ordinate corresponding with character " G " are less than the second predetermined threshold value, character " C " and word Difference corresponding to symbol " H " between ordinate is less than the second predetermined threshold value, between character " D " ordinate corresponding with character " I " Difference be less than the second predetermined threshold value.
Such as character set " ABCDEFGHI " and character set " 1234567890 ", because character " E " and character " 1 " are right Difference between the abscissa answered is more than the first predetermined threshold value, therefore character set " ABCDEFGHI " and character set " 1234567890 " are two different character sets.Character set " ABCDEFGHI " and character set " DDDDDD " for another example, Because the difference between character " F " ordinate corresponding with character " D " is more than the second predetermined threshold value, therefore character set " ABCDEFGHI " and character set " DDDDDD " character set that to be two different.
Optionally, the first predetermined threshold value is related to the width of character, the height correlation of the second predetermined threshold value and character.
Optionally, the first predetermined threshold value and the second predetermined threshold value are default value, the predetermined threshold value be systemic presupposition or It is artificial default.
It should be noted that the first predetermined threshold value can be the same or different with the second predetermined threshold value.
Step 103, character coordinates corresponding to the character in each character set, it is determined that corresponding to each character set Range of cells.
In a kind of mode in the cards, because range of cells corresponding to character set is to include the character set actually The scope of the apex coordinate of the rectangular unit grid of all characters in conjunction, therefore in order that obtain range of cells corresponding to character set All characters in the character set are included, can be according to the minimum abscissa of each character, maximum in the character set Abscissa, the ordinate of minimum ordinate and maximum determine range of cells corresponding to the character set.
Fig. 1 C are the sides of range of cells method corresponding to each character set of determination of one embodiment of the invention offer Method flow chart.As shown in Figure 1 C, step 103 can be substituted by step 103a to step 103f.
Step 103a, for each character set, in character set in the abscissa of each character, it is determined that minimum horizontal seat Mark and maximum abscissa, in character set in the ordinate of each character, it is determined that minimum ordinate and maximum ordinate.
In character set in the abscissa of each character, minimum abscissa and maximum abscissa can be considered the character set The abscissa of two longitudinal boundaries of range of cells corresponding to conjunction;It is minimum in character set in the ordinate of each character Ordinate and maximum ordinate can be considered the abscissa of two horizontal boundaries of range of cells corresponding to the character set.By Abscissa in character set can be in the border of range of cells determined by the abscissa of minimum and maximum abscissa Character in the range of the abscissa is included, the side of range of cells determined by minimum ordinate and maximum ordinate The character that ordinate in character set is in the range of the ordinate can be included by boundary, therefore, according to each in character set The minimum abscissa of individual character, maximum abscissa, minimum ordinate and maximum ordinate can determine that including the word The range of cells of all characters in symbol set.
Step 103b, minimum abscissa is subtracted to the value obtained after the first predetermined value, second is subtracted with minimum ordinate The value obtained after predetermined value, it is defined as the first apex coordinate of range of cells corresponding to character set.
Step 103c, minimum abscissa is subtracted to the value obtained after the first predetermined value, second is added with maximum ordinate The value obtained after predetermined value, it is defined as the second apex coordinate of range of cells corresponding to character set.
Step 103d, maximum abscissa is added to the value obtained after the first predetermined value, second is subtracted with minimum ordinate The value obtained after predetermined value, it is defined as the 3rd apex coordinate of range of cells corresponding to character set.
Step 103e, maximum abscissa is added to the value obtained after the first predetermined value, second is added with maximum ordinate The value obtained after predetermined value, it is defined as the 4th apex coordinate of range of cells corresponding to character set.
Step 103f, according to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th apex coordinate, really Determine range of cells corresponding to character set.
Because coordinate corresponding to character is the centre coordinate of the character, and determine the minimum horizontal seat of the range of cells Mark, maximum abscissa, minimum ordinate and maximum ordinate are obtained according to the coordinate of each character in the character set Arrive, if therefore directly according to the coordinate determining unit lattice scope of character, the border gesture of cell corresponding to the range of cells In must can not being entirely included in the partial character in the character set.Fig. 1 D are the determination words that one embodiment of the invention provides The schematic diagram of range of cells corresponding to symbol set.As shown in figure iD, using in the abscissa of each character in character set most Small abscissa x1 and maximum abscissa x2, and minimum ordinate y1 in character set in the ordinate of each character and Maximum ordinate y2, the range of cells 20 determined, the partial character in the character set can not be entirely included in interior.
In order to avoid the range of cells determined, the partial character in the character set can not be entirely included in Situation occurs, and needs the font size or character height and width of character to take into account during determining unit lattice scope.
Optionally, the first predetermined value is related to the width of character, the height correlation of the second predetermined value and character.Such as First predetermined value is the width of character, and the second predetermined value is the height of character;Or first predetermined value be character width The half of degree, the second predetermined value are the half of the height of character.Wherein, the width of each character and length is in a step 101 What parsing obtained.
Optionally, the first predetermined value and the second predetermined value are default value, the default value be systemic presupposition or It is artificial default.
Referring also to Fig. 1 D, minimum abscissa x1 is subtracted to the value x1` obtained after the first predetermined value, with minimum ordinate y1 The value y1` obtained after the second predetermined value is subtracted, is defined as the first apex coordinate (x1 of range of cells corresponding to character set `, y1`).Minimum abscissa x1 is subtracted to the value x1` obtained after the first predetermined value, it is predetermined plus second with maximum ordinate y2 The value y2` obtained after numerical value, it is defined as the second apex coordinate (x1`, y2`) of range of cells corresponding to character set.Will most Big abscissa x2 adds the value x2` obtained after the first predetermined value, is obtained after subtracting the second predetermined value with minimum ordinate y1 Value y1`, be defined as the 3rd apex coordinate (x2`, y1`) of range of cells corresponding to character set.By maximum abscissa x2 Plus the value x2` obtained after the first predetermined value, the value y2` obtained after the second predetermined value is added with maximum ordinate y2, really It is set to the 4th apex coordinate (x2`, y2`) of range of cells corresponding to character set.
Compared to the minimum abscissa x1 in the abscissa using each character in character set and maximum abscissa x2, with And the minimum ordinate y1 in character set in the ordinate of each character and maximum ordinate y2, the cell model determined Enclose 20, according to the first apex coordinate (x1`, y1`), the second apex coordinate (x1`, y2`), the 3rd apex coordinate (x2`, y1`) and 4th apex coordinate (x2`, y2`), range of cells 21 corresponding to the character set determined can be by character set In each abscissa corresponding to character is entirely included in " XXXXXXXXXXXXXXXX ".
Step 104, coordinate and each character set corresponding to the character that is included according to each character set, each character Range of cells corresponding to conjunction, generate dominant form.
According to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th apex coordinate, it is determined that and generating certain After the four edges of one character set corresponding unit lattice, according to coordinate corresponding to each character of the character set, by each character Insert in range of cells corresponding to the character set, generate dominant form.
In summary, recessive table extracting method provided in an embodiment of the present invention, according to each character pair in destination document The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious The effect of property form.
The corresponding cell of one character set, due to including more than one cell in form, therefore, it is determined that every , it is necessary to the coordinate of the adjacent edge of adjacent cells lattice scope to belonging to same form after range of cells corresponding to individual character set It is adjusted so that the adjacent edge of adjacent cells lattice scope overlaps.Fig. 2A is refer to, it illustrates another embodiment of the present invention The method flow diagram of the recessive table extracting method of offer.The recessive table extracting method may include steps of:
Step 201, destination document is parsed, obtains coordinate corresponding to each character and each character in destination document.
Step 202, the coordinate according to corresponding to each character, it is same that distance is met that the default character close to condition is subdivided into Character set.
Step 203, character coordinates corresponding to the character in each character set, it is determined that corresponding to each character set Range of cells.
Step 204, the character set that the distance between corresponding range of cells is less than to the 3rd predetermined threshold value is defined as Character set in same recessive form, the character set in same recessive form is subdivided into identity set group.
Optionally, the distance between two range of cells between the centre coordinate of the two range of cells away from From.
Optionally, the distance between two range of cells, it is phase in apex coordinate corresponding to the two range of cells The distance between adjacent vertices coordinate.
Referring also to Figure 1B, such as, range of cells 11 corresponding to character " ABCDEFGHI " and character " 1234567890 " are right The distance between range of cells 12 answered is less than the 3rd predetermined threshold value, therefore by character " ABCDEFGHI " and character " 1234567890 " are defined as the character set in same recessive form, and by character " ABCDEFGHI " and character " 1234567890 " are subdivided into identity set group.Such as range of cells 11 and character corresponding to character " ABCDEFGHI " The distance between range of cells 13 corresponding to " DDDDDD " is less than the 3rd predetermined threshold value, therefore by character " ABCDEFGHI " and word Symbol " DDDDDD " is subdivided into identity set group.For another example, range of cells 13 and character corresponding to character " DDDDDD " The distance between range of cells 14 corresponding to " FFFFFFF " is more than the 3rd predetermined threshold value, therefore judges character " DDDDDD " and word Symbol " FFFFFFF " is not belonging to the character set in same recessive form, that is, is not belonging to identity set group.
It should be noted that when two range of cells are pre- less than the 3rd with the distance between same range of cells respectively , will be same hidden if during threshold value, character corresponding to the two range of cells is defined as into the character set in same recessive form Character set in property form is subdivided into identity set group.Referring also to Figure 1B, such as cell corresponding to character " ABCDEFGHI " The range of cells 13 corresponding with character " DDDDDD " of range of cells 12 corresponding to scope 11, character " 1234567890 " this three In individual range of cells, the unit corresponding with character " 1234567890 " of range of cells 11 corresponding to character " ABCDEFGHI " The distance between lattice scope 12 is less than the 3rd predetermined threshold value, range of cells 11 and character corresponding to character " ABCDEFGHI " The distance between range of cells 13 corresponding to " DDDDDD " is less than the 3rd predetermined threshold value, by character " ABCDEFGHI ", character " 1234567890 " and character " DDDDDD " are defined as the character set in same recessive form, and by character " ABCDEFGHI ", Character " 1234567890 " is subdivided into identity set group with character " DDDDDD ".
Optionally, the width or height correlation of the 3rd predetermined threshold value and character.
Optionally, the 3rd predetermined threshold value is default value, and the default value is systemic presupposition or artificially preset.
In a kind of mode in the cards, due to belonging to the usual phase of cell formats corresponding to the cell of same form Together, therefore the distance between the range of cells according to corresponding to two cells judges whether the two cells belong to same Before form, it need to first judge whether cell formats are identical corresponding to the two cells.Fig. 2 B are one embodiment of the invention The method flow diagram that the character set in same recessive form is subdivided into identity set prescription method provided.As shown in Figure 2 B, The step 204 can be replaced by:
Step 204a, the coordinate of the character included according to each character set, it is determined that unit corresponding to each character set Sound of laughing formula.
It should be noted that the cell formats being previously mentioned in the present embodiment include but is not limited to alignment mode, word Accord with font size, character line space, character background color.
Step 204b, the distance between corresponding range of cells is less than the 3rd predetermined threshold value, and corresponding cell Form identical character set is defined as the character set in same recessive form, and the character set in same recessive form is drawn It is divided into identity set group.
Referring also to Figure 1B, when cell formats are alignment mode, cell model corresponding to character " ABCDEFGHI " Enclose the range of cells 13 corresponding with character " DDDDDD " of range of cells 12 corresponding to 11, character " 1234567890 " these three In range of cells, due to the cell corresponding with character " DDDDDD " of range of cells 11 corresponding to character " ABCDEFGHI " The distance between scope 13 is less than the 3rd predetermined threshold value, and corresponding alignment mode identical (i.e. range of cells 11 and list Alignment mode corresponding to first lattice scope 13 is to align right), therefore by character " ABCDEFGHI " and character " DDDDDD " The character set being defined as in same recessive form, and character " ABCDEFGHI " and character " DDDDDD " are subdivided into same collection It is charge-coupled.Although the range of cells corresponding with character " 1234567890 " of range of cells 11 corresponding to character " ABCDEFGHI " The distance between 12 are less than the 3rd predetermined threshold value, but corresponding alignment mode is differed (i.e. corresponding to range of cells 11 For alignment mode to align right, alignment mode corresponding to range of cells 13 is to left-justify), therefore by character " ABCDEFGHI " and character " 1234567890 " are not belonging to the character set in same recessive form, that is, are not belonging to identity set Group.
Step 205, it is charge-coupled for each collection, it is right in range of cells corresponding to the charge-coupled each character set included is collected The coordinate of the adjacent vertex of adjacent cells lattice scope is adjusted, and the adjacent vertex after regulation is coincided.
The adjacent edge of adjacent cells lattice in usual same form overlaps, and the width of same row cell is identical, It is identical with the height of a line cell.Because range of cells corresponding to character set is sat according to each character in character set What mark determined, the width of range of cells corresponding to the different character set of character quantity and height can difference, therefore , it is necessary to which the coordinate of the adjacent vertex of the adjacent cells lattice in same form is adjusted before generating dominant form, make regulation Adjacent vertex afterwards coincides, so as to the display of optimization table.
First, adjacent cells lattice scope is determined:
In identity set group, if the minimum abscissa in range of cells A apex coordinate is with range of cells B's The distance between maximum abscissa in apex coordinate, than in the apex coordinate of other range of cells in identity set group Minimum abscissa and the distance between the maximum abscissa in range of cells B apex coordinate it is small, then identifying unit lattice Scope A is range of cells B laterally adjacent range of cells.
In identity set group, if the maximum abscissa in range of cells A apex coordinate is with range of cells B's The distance between minimum abscissa in apex coordinate, than in the apex coordinate of other range of cells in identity set group Maximum abscissa and the distance between the minimum abscissa in range of cells B apex coordinate it is small, then identifying unit lattice Scope A is range of cells B laterally adjacent range of cells.
In identity set group, if the minimum ordinate in range of cells A apex coordinate is with range of cells B's The distance between maximum ordinate in apex coordinate, than in the apex coordinate of other range of cells in identity set group Minimum ordinate and the distance between the maximum ordinate in range of cells B apex coordinate it is small, then identifying unit lattice Scope A is range of cells B longitudinally adjacent range of cells.
In identity set group, if the maximum ordinate in range of cells A apex coordinate is with range of cells B's The distance between minimum ordinate in apex coordinate, than in the apex coordinate of other range of cells in identity set group Maximum ordinate and the distance between the minimum ordinate in range of cells B apex coordinate it is small, then identifying unit lattice Scope A is range of cells B longitudinally adjacent range of cells.
Secondly, apex coordinate corresponding to the often row range of cells and each column range of cells in form is individually adjusted (adjust apex coordinate corresponding to range of cells laterally adjacent to each other and adjust range of cells pair adjacent longitudinally of one another The apex coordinate answered):
For in apex coordinate corresponding to range of cells laterally adjacent to each other, determining that maximum ordinate and minimum are indulged Coordinate, the maximum ordinate determined is replaced in the range of cells laterally adjacent to each other in each range of cells most Big ordinate, the minimum ordinate determined is replaced in the range of cells laterally adjacent to each other in each range of cells Minimum ordinate;For in apex coordinate corresponding to range of cells adjacent longitudinally of one another, determine maximum abscissa and Minimum abscissa, the maximum abscissa determined is replaced into each range of cells in adjacent range of cells longitudinally of one another In maximum abscissa, the maximum abscissa determined is replaced into each cell in adjacent range of cells longitudinally of one another Maximum abscissa in scope.
Finally, summit corresponding to the adjacent often row range of cells in form and adjacent each column range of cells is sat Mark is adjusted:
In identity set group, when range of cells A is range of cells B laterally adjacent range of cells, if Between minimum abscissa in range of cells A apex coordinate and the maximum abscissa in range of cells B apex coordinate It is closest, then in the minimum abscissa in range of cells A apex coordinate and range of cells B apex coordinate Maximum abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice scope A column cells that will be obtained Minimum abscissa in scope in the apex coordinate of all range of cells and institute in range of cells B column range of cells There is the maximum abscissa in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B laterally adjacent range of cells, if Between maximum abscissa in range of cells A apex coordinate and the minimum abscissa in range of cells B apex coordinate It is closest, then in the maximum abscissa in range of cells A apex coordinate and range of cells B apex coordinate Minimum abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice scope A column cells that will be obtained Maximum abscissa in scope in the apex coordinate of all range of cells and institute in range of cells B column range of cells There is the minimum abscissa in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B longitudinally adjacent range of cells, if Between minimum ordinate in range of cells A apex coordinate and the maximum ordinate in range of cells B apex coordinate It is closest, then in the minimum ordinate in range of cells A apex coordinate and range of cells B apex coordinate Maximum ordinate is averaged to obtain mean ordinate, and obtained mean ordinate replacement unit lattice scope A is expert at cell Minimum ordinate in scope in the apex coordinate of all range of cells and range of cells B are expert at institute in range of cells There is the maximum ordinate in the apex coordinate of range of cells.
In identity set group, when range of cells A is range of cells B longitudinally adjacent range of cells, if Between maximum ordinate in range of cells A apex coordinate and the minimum ordinate in range of cells B apex coordinate It is closest, then in the maximum ordinate in range of cells A apex coordinate and range of cells B apex coordinate Minimum ordinate is averaged to obtain mean ordinate, and obtained mean ordinate replacement unit lattice scope A is expert at cell Maximum ordinate in scope in the apex coordinate of all range of cells and range of cells B are expert at institute in range of cells There is the minimum ordinate in the apex coordinate of range of cells.
Fig. 2 C are refer to, the adjacent vertex to adjacent cells lattice scope provided it illustrates one embodiment of the invention Coordinate is adjusted, the schematic diagram for making the adjacent vertex after regulation coincide.Wherein, range of cells 30, range of cells 31, Character set corresponding to range of cells 32 and range of cells 33 is identity set group
Step 1, determine adjacent cells lattice scope:
Due to the maximum abscissa (i.e. summit 30a and summit 30b abscissa) in the apex coordinate of range of cells 30 The distance between minimum abscissa (i.e. summit 31a and summit 31b abscissa) in the apex coordinate of range of cells 31, Than the top with the range of cells 31 respectively of the maximum abscissa in the apex coordinate of range of cells 32 and range of cells 33 The distance between minimum abscissa in point coordinates is small, therefore, it is determined that range of cells 30 is the laterally adjacent of range of cells 31 Range of cells.
Due to the minimum ordinate (i.e. summit 30c and summit 30b ordinate) in the apex coordinate of range of cells 30 The distance between maximum ordinate (i.e. summit 32a and summit 32b abscissa) in the apex coordinate of range of cells 32, Than the top with the range of cells 32 respectively of the minimum ordinate in the apex coordinate of range of cells 31 and range of cells 33 The distance between maximum ordinate in point coordinates is small, therefore, it is determined that range of cells 30 is the longitudinally adjacent of range of cells 32 Range of cells.
Can similarly obtain, range of cells 32 be range of cells 33 laterally adjacent range of cells, range of cells 31 For the longitudinally adjacent range of cells of range of cells 33.
Step 2, apex coordinate corresponding to the often row range of cells and each column range of cells in form is individually adjusted It is whole:
For in apex coordinate corresponding to range of cells 30 and range of cells 31 laterally adjacent to each other, determining most Big ordinate (i.e. summit 30a ordinate) and minimum ordinate (i.e. summit 30c, summit 30b and summit 31b vertical seat Mark), the maximum ordinate determined is replaced and the maximum in the range of cells 31 laterally adjacent to each other of range of cells 30 Ordinate (i.e. 31a ordinate), sat because in apex coordinate corresponding to range of cells 30 and range of cells 31, minimum is vertical Mark identical, therefore do not perform replacement step.
For in apex coordinate corresponding to adjacent longitudinally of one another range of cells 30 and range of cells 32, determining most Big abscissa (i.e. summit 30a, summit 32b abscissa) and minimum abscissa (summit 32a abscissa), will be determined most Small abscissa is replaced and minimum abscissa (the i.e. 30c horizontal stroke in range of cells 32 longitudinally of one another adjacent range of cells 30 Coordinate), because in apex coordinate corresponding to range of cells 30 and range of cells 32, maximum abscissa is identical, therefore does not hold Row replacement step.
Similarly, apex coordinate corresponding to range of cells 32 and range of cells 33 laterally adjacent to each other is adjusted It is whole, apex coordinate corresponding to adjacent longitudinally of one another range of cells 31 and range of cells 33 is adjusted.
In form after adjustment, often the minimum ordinate of row range of cells is identical with maximum ordinate, per column unit The minimum abscissa of lattice scope is identical with maximum abscissa.
Step 3, to summit corresponding to the adjacent often row range of cells in form and adjacent each column range of cells Coordinate is adjusted:
Due in the apex coordinate of the maximum abscissa in the apex coordinate of range of cells 30 and unit scope 31 most The distance between small abscissa is nearest, therefore, by the coordinate where the maximum abscissa in the apex coordinate of range of cells 30 30a is defined as adjacent coordinates with the coordinate 31a where the minimum abscissa in the apex coordinate of range of cells 31, by cell The coordinate 30b where maximum abscissa in the apex coordinate of scope 30 and the minimum in the apex coordinate of range of cells 31 are horizontal Coordinate 31b where coordinate is defined as adjacent coordinates.To the coordinate 30a of range of cells 30 abscissa and range of cells 31 Coordinate 31a abscissa be averaged to obtain average abscissa, or the abscissa and list of the coordinate 30b to range of cells 30 The coordinate 31b of first lattice scope 31 abscissa is averaged to obtain average abscissa, the average abscissa replacement unit lattice that will be obtained The apex coordinate of all range of cells (range of cells 30 and range of cells 32) in the column range of cells of scope 30 In maximum abscissa (i.e. coordinate 30a, coordinate 30b, coordinate 32b and coordinate 32c abscissa) and the institute of range of cells 31 Minimum in column unit lattice scope in the apex coordinate of all range of cells (range of cells 31 and range of cells 32) Abscissa (i.e. coordinate 31a, coordinate 31b, coordinate 33a and coordinate 33b abscissa).
Due in the apex coordinate of the minimum ordinate in the apex coordinate of range of cells 30 and unit scope 32 most The distance between big ordinate is nearest, therefore, by the coordinate where the minimum ordinate in the apex coordinate of range of cells 30 30c is defined as adjacent coordinates with the coordinate 32a where the maximum ordinate in the apex coordinate of range of cells 32, by cell The coordinate 30b where minimum ordinate in the apex coordinate of scope 30 indulges with the maximum in the apex coordinate of range of cells 32 Coordinate 32b where coordinate is defined as adjacent coordinates.To the coordinate 30c of range of cells 30 ordinate and range of cells 32 Coordinate 32a ordinate be averaged to obtain mean ordinate, or the ordinate and list of the coordinate 30b to range of cells 30 The coordinate 32b of first lattice scope 32 ordinate is averaged to obtain mean ordinate, the mean ordinate replacement unit lattice that will be obtained Scope 30 is expert at the apex coordinates of all range of cells (range of cells 30 and range of cells 31) in range of cells In minimum ordinate (i.e. coordinate 30c, coordinate 30b and coordinate 31b ordinate) and range of cells 32 be expert at unit Maximum ordinate in lattice scope in the apex coordinate of all range of cells (range of cells 32 and range of cells 33) is (i.e. Coordinate 32a, coordinate 32b and coordinate 33a ordinate).
Step 206, it is charge-coupled for each collection for including multiple character sets, according to each character set bag in collecting charge-coupled Range of cells, generation are dominant corresponding to each character set during coordinate corresponding to the character that contains, each character and collection are charge-coupled Form.
According to the coordinate of each range of cells after adjustment, it is determined that and generate corresponding to cell four edges after, root According to coordinate corresponding to the collection character that each character set includes in charge-coupled, each character, each character is inserted into corresponding character In range of cells corresponding to set, dominant form is generated.
It should be noted that because step 201 to step 203 is similar to step 103 with step 101, therefore the present embodiment Explanation is not repeated to step 201 to step 203.
In summary, recessive table extracting method provided in an embodiment of the present invention, according to each character pair in destination document The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious The effect of property form.
Following is apparatus of the present invention embodiment, for the details of not detailed description in device embodiment, be may be referred to above-mentioned One-to-one embodiment of the method.
Fig. 3 is refer to, Fig. 3 is the block diagram of the recessive form extraction element provided in one embodiment of the invention. The device includes:Parsing module 301, the first division module 302, the first determining module 303 and generation module 304.
Parsing module 301, for parsing destination document, obtain corresponding to each character and each character in destination document Coordinate;
First division module 302, for the coordinate according to corresponding to each character, distance is met into the default word close to condition Symbol is defined as the character in same recessive form, and the character in same recessive form is subdivided into same character set;
First determining module 303, for character coordinates corresponding to the character in each character set, it is determined that each word Range of cells corresponding to symbol set;
Generation module 304, for character, the coordinate corresponding to each character and every included according to each character set Range of cells corresponding to individual character set, generate dominant form.
In a kind of possible implementation, the first division module 302, it is additionally operable to:
Difference difference between abscissa being less than between the first predetermined threshold value or ordinate is less than the second predetermined threshold value Character be defined as character in same recessive form, the character in same recessive form is subdivided into same character set.
In a kind of possible implementation, the first determining module 303, including:
First determining unit, for for each character set, in character set in the abscissa of each character, it is determined that most Small abscissa and maximum abscissa, in character set in the ordinate of each character, it is determined that minimum ordinate and maximum Ordinate;
Second determining unit, for minimum abscissa to be subtracted into the value obtained after the first predetermined value, with minimum ordinate The value obtained after the second predetermined value is subtracted, is defined as the first apex coordinate of range of cells corresponding to character set;
3rd determining unit, for minimum abscissa to be subtracted into the value obtained after the first predetermined value, with maximum ordinate Plus the value obtained after the second predetermined value, it is defined as the second apex coordinate of range of cells corresponding to character set;
4th determining unit, for maximum abscissa to be added into the value obtained after the first predetermined value, with minimum ordinate The value obtained after the second predetermined value is subtracted, is defined as the 3rd apex coordinate of range of cells corresponding to character set;
5th determining unit, for maximum abscissa to be added into the value obtained after the first predetermined value, with maximum ordinate Plus the value obtained after the second predetermined value, it is defined as the 4th apex coordinate of range of cells corresponding to character set;
6th determining unit, for according to the first apex coordinate, the second apex coordinate, the 3rd apex coordinate and the 4th summit Coordinate, determine range of cells corresponding to character set.
In a kind of mode in the cards, the device also includes:
Second division module, for it is determined that after range of cells corresponding to each character set, by corresponding unit The distance between lattice scope is defined as the character set in same recessive form less than the character set of the 3rd predetermined threshold value, will be same Character set in one recessive form is subdivided into identity set group;
Generation module, it is additionally operable to:
It is charge-coupled for each collection for including multiple character sets, according to the collection word that each character set includes in charge-coupled Range of cells corresponding to each character set, generates dominant form during coordinate corresponding to symbol, each character and collection are charge-coupled.
In a kind of mode in the cards, the device also includes:
Second determining module, for the coordinate of the character included according to each character set, it is determined that each character set pair The cell formats answered;
Second division module, is additionally operable to:
The distance between corresponding range of cells is less than the 3rd predetermined threshold value, and corresponding cell formats identical Character set is defined as the character set in same recessive form, and the character set in same recessive form is subdivided into same collection It is charge-coupled.
In a kind of mode in the cards, the device also includes:
Adjustment module, in the character included according to each character set, coordinate corresponding to each character and each Range of cells corresponding to character set, it is charge-coupled for each collection before generating dominant form, collecting the charge-coupled each character included In range of cells corresponding to set, the coordinate of the adjacent vertex of adjacent cells lattice scope is adjusted, makes the phase after regulation Adjacent vertices coincides.
In summary, recessive form extraction element provided in an embodiment of the present invention, according to each character pair in destination document The coordinate answered, determine character set and the character set corresponding to range of cells, dominant form is generated, due to according to character Coordinate corresponding to character can more accurately determining unit lattice scope in set;Therefore solves the extraction skill of existing PDF document Extraction of the art for the list data of PDF document, the problem of lacking corresponding processing mode;Reach according to hidden in destination document Property form in the coordinate of character determine range of cells in recessive form, and generated according to the range of cells determined aobvious The effect of property form.
It should be noted that:The recessive form extraction element provided in above-described embodiment is when extracting form, only with above-mentioned The division progress of each functional module, can be as needed and by above-mentioned function distribution by different for example, in practical application Functional module is completed, i.e., the internal structure of electronic equipment is divided into different functional modules, to complete whole described above Or partial function.In addition, recessive form extraction element and recessive table extracting method embodiment category that above-described embodiment provides In same design, its specific implementation process refers to embodiment of the method, repeated no more here.
Fig. 4 is a kind of block diagram of terminal according to an exemplary embodiment.The terminal 400 is embodied as the end in Fig. 1 End 140.For example, terminal 400 can be mobile phone, and computer, digital broadcast terminal, messaging devices, game console, Tablet device, Medical Devices, body-building equipment, personal digital assistant etc..
Reference picture 4, terminal 400 can include following one or more assemblies:Processing component 402, memory 404, power supply Component 406, multimedia groupware 408, audio-frequency assembly 410, input/output (I/O) interface 412, sensor cluster 414, Yi Jitong Believe component 416.
Processing component 402 generally controls the integrated operation of terminal 400, is such as communicated with display, call, data, phase The operation that machine operates and record operation is associated.Processing component 402 can refer to including one or more processors 418 to perform Order, to complete all or part of step of above-mentioned method.In addition, processing component 402 can include one or more modules, just Interaction between processing component 402 and other assemblies.For example, processing component 402 can include multi-media module, it is more to facilitate Interaction between media component 408 and processing component 402.
Memory 404 is configured as storing various types of data to support the operation in terminal 400.These data are shown Example includes the instruction of any application program or method for being operated in terminal 400, contact data, telephone book data, disappears Breath, picture, video etc..Memory 404 can be by any kind of volatibility or non-volatile memory device or their group Close and realize, as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM) are erasable to compile Journey read-only storage (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 406 provides electric power for the various assemblies of terminal 400.Power supply module 406 can include power management system System, one or more power supplys, and other components associated with generating, managing and distributing electric power for terminal 400.
Multimedia groupware 408 is included in the screen of one output interface of offer between terminal 400 and user.In some realities Apply in example, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen can To be implemented as touch-screen, to receive the input signal from user.Touch panel include one or more touch sensors with Gesture on sensing touch, slip and touch panel.Touch sensor can the not only border of sensing touch or sliding action, and And also detection and the duration and pressure touched or slide is related.In certain embodiments, multimedia groupware 408 includes One front camera and/or rear camera.It is preceding during such as screening-mode or video mode when terminal 400 is in operator scheme The multi-medium data of outside can be received by putting camera and/or rear camera.Each front camera and rear camera can To be a fixed optical lens system or there is focusing and optical zoom capabilities.
Audio-frequency assembly 410 is configured as output and/or input audio signal.For example, audio-frequency assembly 410 includes a Mike Wind (MIC), when terminal 400 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The audio signal received can be further stored in memory 404 or via communication set Part 416 is sent.In certain embodiments, audio-frequency assembly 410 also includes a loudspeaker, for exports audio signal.
I/O interfaces 412 provide interface between processing component 402 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to:Home button, volume button, start button and lock Determine button.
Sensor cluster 414 includes one or more sensors, and the state for providing various aspects for terminal 400 is commented Estimate.For example, sensor cluster 414 can detect opening/closed mode of terminal 400, the relative positioning of component, such as component For the display and keypad of terminal 400, sensor cluster 414 can be with the position of 400 1 components of detection terminal 400 or terminal Put change, the existence or non-existence that user contacts with terminal 400, the orientation of terminal 400 or the temperature of acceleration/deceleration and terminal 400 Change.Sensor cluster 414 can include proximity transducer, be configured in no any physical contact near detection The presence of object.Sensor cluster 414 can also include optical sensor, such as CMOS or ccd image sensor, for should in imaging With middle use.In certain embodiments, the sensor cluster 414 can also include acceleration transducer, gyro sensor, magnetic Sensor, pressure sensor or temperature sensor.
Communication component 416 is configured to facilitate the communication of wired or wireless way between terminal 400 and other equipment.Terminal 400 can access the wireless network based on communication standard, such as Wi-Fi, 2G or 3G, or combinations thereof.In an exemplary reality Apply in example, communication component 416 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel Breath.In one exemplary embodiment, communication component 416 also includes near-field communication (NFC) module, to promote junction service.Example Such as, in NFC module radio frequency identification (RFID) technology can be based on, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, terminal 400 can be believed by one or more application specific integrated circuits (ASIC), numeral Number processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing above-mentioned each embodiment of the method The downlink data packet collocation method of offer.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 404 of instruction, above-mentioned instruction can be performed by the processor 418 of terminal 400 to complete above-mentioned downlink data packet Collocation method.For example, non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, Tape, floppy disk and optical data storage devices etc..
It should be appreciated that it is used in the present context, unless context clearly supports exception, singulative " one It is individual " (" a ", " an ", " the ") be intended to also include plural form.It is to be further understood that "and/or" used herein is Referring to includes any of one or more than one project listed in association and is possible to combine.
Those skilled in the art will readily occur to the application its after considering specification and putting into practice invention disclosed herein Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or Person's adaptations follow the general principle of the application and including the undocumented common knowledges in the art of the application Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following Claim is pointed out.
It should be appreciated that the precision architecture that the application is not limited to be described above and is shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only limited by appended claim.

Claims (10)

1. a kind of recessive table extracting method, it is characterised in that methods described includes:
Destination document is parsed, obtains coordinate corresponding to each character and each character in the destination document;
According to coordinate corresponding to each character, distance is met that the default character close to condition is defined as in same recessive form Character, the character in same recessive form is subdivided into same character set;
Character coordinates corresponding to character in each character set, it is determined that range of cells corresponding to each character set;
Unit corresponding to coordinate corresponding to the character that is included according to each character set, each character and each character set Lattice scope, generate dominant form.
2. according to the method for claim 1, it is characterised in that described that distance is met that the default character close to condition determines For the character in same recessive form, the character in same recessive form is subdivided into same character set, including:
Difference between abscissa is less than to the word that the difference between the first predetermined threshold value or ordinate is less than the second predetermined threshold value Symbol is defined as the character in same recessive form, and the character in same recessive form is subdivided into same character set.
3. according to the method for claim 1, it is characterised in that word corresponding to the character in each character set of basis Coordinate is accorded with, it is determined that range of cells corresponding to each character set, including:
For each character set, in the character set in the abscissa of each character, it is determined that minimum abscissa and maximum Abscissa, in the character set in the ordinate of each character, it is determined that minimum ordinate and maximum ordinate;
The minimum abscissa is subtracted to the value obtained after the first predetermined value, the second predetermined number is subtracted with the minimum ordinate The value obtained after value, it is defined as the first apex coordinate of range of cells corresponding to the character set;
The minimum abscissa is subtracted to the value obtained after first predetermined value, with the maximum ordinate plus described the The value obtained after two predetermined values, it is defined as the second apex coordinate of range of cells corresponding to the character set;
The maximum abscissa is added into obtained value after first predetermined value, described the is subtracted with the minimum ordinate The value obtained after two predetermined values, it is defined as the 3rd apex coordinate of range of cells corresponding to the character set;
The maximum abscissa is added into obtained value after first predetermined value, with the maximum ordinate plus described the The value obtained after two predetermined values, it is defined as the 4th apex coordinate of range of cells corresponding to the character set;
According to first apex coordinate, second apex coordinate, the 3rd apex coordinate and the 4th apex coordinate, Determine range of cells corresponding to the character set.
4. according to the method for claim 1, it is characterised in that determine cell model corresponding to each character set described After enclosing, methods described also includes:
The character set that the distance between corresponding range of cells is less than to the 3rd predetermined threshold value is defined as same recessive form In character set, the character set in same recessive form is subdivided into identity set group;
Corresponding to coordinate corresponding to character that each character set of basis includes, each character and each character set Range of cells, dominant form is generated, including:
It is charge-coupled for each collection for including multiple character sets, according to the collection word that each character set includes in charge-coupled Range of cells corresponding to each character set, generates dominant table during coordinate corresponding to symbol, each character and the collection are charge-coupled Lattice.
5. according to the method for claim 4, it is characterised in that methods described also includes:
The coordinate of the character included according to each character set, it is determined that cell formats corresponding to each character set;
The character set that the distance between corresponding range of cells is less than to the 3rd predetermined threshold value is defined as same recessiveness Character set in form, the character set in same recessive form is subdivided into identity set group, including:
The distance between corresponding range of cells is less than the 3rd predetermined threshold value, and corresponding cell formats identical character Set is defined as the character set in same recessive form, and the character set in same recessive form is subdivided into identity set Group.
6. according to the method for claim 4, it is characterised in that the character that is included in each character set of the basis, every Range of cells corresponding to coordinate corresponding to individual character and each character set, before generating dominant form, methods described is also Including:
It is charge-coupled for each collection, in range of cells corresponding to the charge-coupled each character set included of collection, to adjacent cells The coordinate of the adjacent vertex of lattice scope is adjusted, and the adjacent vertex after regulation is coincided.
7. a kind of recessive form extraction element, it is characterised in that described device includes:
Parsing module, for parsing destination document, obtain sitting corresponding to each character and each character in the destination document Mark;
First division module, for the coordinate according to corresponding to each character, distance is met that the default character close to condition determines For the character in same recessive form, the character in same recessive form is subdivided into same character set;
First determining module, for character coordinates corresponding to the character in each character set, it is determined that each character set Corresponding range of cells;
Generation module, for included according to each character set character, coordinate and each character set corresponding to each character Range of cells corresponding to conjunction, generate dominant form.
8. device according to claim 7, it is characterised in that first division module, be additionally operable to:
Difference between abscissa is less than to the word that the difference between the first predetermined threshold value or ordinate is less than the second predetermined threshold value Symbol is defined as the character in same recessive form, and the character in same recessive form is subdivided into same character set.
9. device according to claim 7, it is characterised in that first determining module, including:
First determining unit, for for each character set, in the character set in the abscissa of each character, it is determined that most Small abscissa and maximum abscissa, in the character set in the ordinate of each character, it is determined that minimum ordinate and Maximum ordinate;
Second determining unit, it is vertical with the minimum for the minimum abscissa to be subtracted into the value obtained after the first predetermined value Coordinate subtracts the value obtained after the second predetermined value, and the first summit for being defined as range of cells corresponding to the character set is sat Mark;
3rd determining unit, for the minimum abscissa to be subtracted into the value obtained after first predetermined value, with it is described most Big ordinate adds the value obtained after second predetermined value, is defined as the of range of cells corresponding to the character set Two apex coordinates;
4th determining unit, for the maximum abscissa to be added into obtained value after first predetermined value, with it is described most Small ordinate subtracts the value obtained after second predetermined value, is defined as of range of cells corresponding to the character set Three apex coordinates;
5th determining unit, for the maximum abscissa to be added into obtained value after first predetermined value, with it is described most Big ordinate adds the value obtained after second predetermined value, is defined as the of range of cells corresponding to the character set Four apex coordinates;
6th determining unit, for according to first apex coordinate, second apex coordinate, the 3rd apex coordinate and 4th apex coordinate, determines range of cells corresponding to the character set.
10. device according to claim 7, it is characterised in that described device also includes:
Second division module, for it is described determine range of cells corresponding to each character set after, by corresponding unit The distance between lattice scope is defined as the character set in same recessive form less than the character set of the 3rd predetermined threshold value, will be same Character set in one recessive form is subdivided into identity set group;
The generation module, is additionally operable to:
It is charge-coupled for each collection for including multiple character sets, according to the collection word that each character set includes in charge-coupled Range of cells corresponding to each character set, generates dominant table during coordinate corresponding to symbol, each character and the collection are charge-coupled Lattice.
CN201710839286.6A 2017-09-18 2017-09-18 Hidden table extraction method and device Active CN107622041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710839286.6A CN107622041B (en) 2017-09-18 2017-09-18 Hidden table extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710839286.6A CN107622041B (en) 2017-09-18 2017-09-18 Hidden table extraction method and device

Publications (2)

Publication Number Publication Date
CN107622041A true CN107622041A (en) 2018-01-23
CN107622041B CN107622041B (en) 2021-02-12

Family

ID=61090129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710839286.6A Active CN107622041B (en) 2017-09-18 2017-09-18 Hidden table extraction method and device

Country Status (1)

Country Link
CN (1) CN107622041B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446152A (en) * 2018-02-02 2018-08-24 阿里巴巴集团控股有限公司 page display method and device
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN109815453A (en) * 2018-12-25 2019-05-28 东软集团股份有限公司 Document method of partition, device, storage medium and electronic equipment
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN110147697A (en) * 2018-02-11 2019-08-20 鼎复数据科技(北京)有限公司 A kind of PDF table extracting method based on man-machine mutual assistance
CN110347994A (en) * 2019-07-12 2019-10-18 北京香侬慧语科技有限责任公司 A kind of form processing method and device
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
US11010543B1 (en) 2020-08-11 2021-05-18 Fmr Llc Systems and methods for table extraction in documents
CN113190500A (en) * 2021-04-23 2021-07-30 广东云智安信科技有限公司 Information accumulation filing system and method based on internet report
CN113283398A (en) * 2021-07-13 2021-08-20 国网电子商务有限公司 Table identification method and system based on clustering
CN114218233A (en) * 2022-02-22 2022-03-22 子长科技(北京)有限公司 Annual newspaper processing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848186A (en) * 1995-08-11 1998-12-08 Canon Kabushiki Kaisha Feature extraction system for identifying text within a table image
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848186A (en) * 1995-08-11 1998-12-08 Canon Kabushiki Kaisha Feature extraction system for identifying text within a table image
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446152A (en) * 2018-02-02 2018-08-24 阿里巴巴集团控股有限公司 page display method and device
CN110147697A (en) * 2018-02-11 2019-08-20 鼎复数据科技(北京)有限公司 A kind of PDF table extracting method based on man-machine mutual assistance
CN108470021B (en) * 2018-03-26 2022-06-03 阿博茨德(北京)科技有限公司 Method and device for positioning table in PDF document
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
CN110532834B (en) * 2018-05-24 2022-12-23 北京庖丁科技有限公司 Table extraction method, device, equipment and medium based on rich text format document
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN109670461A (en) * 2018-12-24 2019-04-23 广东亿迅科技有限公司 PDF text extraction method, device, computer equipment and storage medium
CN109815453A (en) * 2018-12-25 2019-05-28 东软集团股份有限公司 Document method of partition, device, storage medium and electronic equipment
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN109871524B (en) * 2019-02-21 2023-06-09 腾讯科技(深圳)有限公司 Chart generation method and device
CN110347994A (en) * 2019-07-12 2019-10-18 北京香侬慧语科技有限责任公司 A kind of form processing method and device
CN110347994B (en) * 2019-07-12 2023-06-30 北京香侬慧语科技有限责任公司 Form processing method and device
US11010543B1 (en) 2020-08-11 2021-05-18 Fmr Llc Systems and methods for table extraction in documents
CN113190500A (en) * 2021-04-23 2021-07-30 广东云智安信科技有限公司 Information accumulation filing system and method based on internet report
CN113283398A (en) * 2021-07-13 2021-08-20 国网电子商务有限公司 Table identification method and system based on clustering
CN114218233A (en) * 2022-02-22 2022-03-22 子长科技(北京)有限公司 Annual newspaper processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107622041B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN107622041A (en) recessive table extracting method and device
CN109191410B (en) Face image fusion method and device and storage medium
CN105528606B (en) Area recognizing method and device
CN105159871B (en) Text message detection method and device
CN104267877B (en) The display methods and device of expression picture, electronic equipment
CN111464716B (en) Certificate scanning method, device, equipment and storage medium
CN103927080A (en) Method and device for controlling control operation
CN108829476A (en) A kind of message display method, terminal and storage medium
WO2016173453A1 (en) Living body identification method, information generation method and terminal
CN108022274A (en) Image processing method, device, computer equipment and computer-readable recording medium
CN104077563B (en) Face identification method and device
EP2921969A1 (en) Method and apparatus for centering and zooming webpage and electronic device
CN104966086A (en) Living body identification method and apparatus
CN106657650A (en) System expression recommendation method and device, and terminal
CN104464674B (en) Liquid crystal display method of adjustment and device
CN104050645A (en) Image processing method and device
CN108804179A (en) Show method, apparatus, terminal and the storage medium of notification bar message
CN107704190A (en) Gesture identification method, device, terminal and storage medium
CN105430250A (en) Mobile terminal and method of controlling the same
CN105426047B (en) Thumbnail methods of exhibiting and device
CN105094364B (en) Vocabulary display methods and device
CN104902318B (en) Control method for playing back and terminal device
CN107025041A (en) Fingerprint input method and terminal
CN107122149A (en) Display methods, device and the terminal of application program
CN107730443A (en) Image processing method, device and user equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190905

Address after: 100089 Unit 6, Floor 3, 25 Shangdi East Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: Room 601, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co., Ltd

Address before: 100089 Haidian District East Road, No. three, floor 6, unit 25,

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant