CN110442719A - A kind of text handling method, device, equipment and storage medium - Google Patents
A kind of text handling method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN110442719A CN110442719A CN201910734656.9A CN201910734656A CN110442719A CN 110442719 A CN110442719 A CN 110442719A CN 201910734656 A CN201910734656 A CN 201910734656A CN 110442719 A CN110442719 A CN 110442719A
- Authority
- CN
- China
- Prior art keywords
- text
- line
- information
- piecemeal
- cut
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Input (AREA)
Abstract
The embodiment of the present disclosure discloses a kind of text handling method, device, equipment and storage medium, the described method includes: obtaining the text location to include in piecemeal text, the line of text location information of at least one line of text and the line of text is determined according to the text location, determine the cut-off rule information to include in piecemeal text, the target range between line of text is determined according to the line of text location information and the cut-off rule information, the line of text is clustered according to the target range, described at least one text block to piecemeal text is determined according to the cluster result of the line of text.The method that the embodiment of the present disclosure provides treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies text sections process, improves the accuracy of text sections result.
Description
Technical field
The embodiment of the present disclosure is related to information technology field more particularly to a kind of text handling method, device, equipment and storage
Medium.
Background technique
Portable Document format (Portable Document Format, PDF) is a kind of with independently of application program, hard
Part, operating system mode the file format of document is presented.Pdf document can restore document styles well, but because of its main mesh
It is to guarantee rendering result, the structural information of content is caused to be ignored.The thus logical construction between PDF document content or semantic knot
Structure can not directly acquire, so being difficult to structuring well.If not doing text sections to PDF document, directly extraction text meeting
There is the problem of sequence entanglement.Therefore it needs to outline character area, guarantees that character order is correct inside block.According still further on to
Under, sequence from left to right arranges block.Therefore text sections are the bases of PDF document structuring.
Currently, text sections method includes the transverse and longitudinal coordinate by page elements, convert two-dimensional surface segmentation problem to
One-dimensional character string parsing problem, followed by the method for partition that rule distinguishes corresponding element, according to point of shape operation
Algorithm is cut, Thiessen polygon (Voronoi) algorithm constrains distance of swimming algorithm or the Region detection algorithms based on deep learning etc..
But current text sections method needs to be arranged a large amount of rules and parameter, recognition result accuracy is not high, or a large amount of numbers of mark
According to being trained, process is cumbersome.
Summary of the invention
The disclosure provides a kind of text handling method, device, equipment and storage medium, simplifies text sections mistake to realize
Journey improves the accuracy of text sections result.
In a first aspect, the embodiment of the present disclosure provides a kind of text handling method, comprising:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location
The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point
Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text
At least one text block to piecemeal text.
Second aspect, the embodiment of the present disclosure additionally provide a kind of text processing apparatus, comprising:
Line of text determining module, for obtaining the text location to include in piecemeal text, according to the text position
Confidence ceases the line of text location information for determining at least one line of text and the line of text;
Target range determining module, it is described to include in piecemeal text for being determined according to the line of text location information
Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the text
Capable cluster result determines described at least one text block to piecemeal text.
The third aspect, the embodiment of the present disclosure additionally provide terminal device, which is characterized in that the terminal device includes:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of places
Manage text handling method of the device realization as described in the embodiment of the present disclosure is any.
Fourth aspect, the embodiment of the present disclosure additionally provide a kind of computer readable storage medium, and the computer is executable
Instruction by computer processor when being executed for executing the text handling method as described in the embodiment of the present disclosure is any.
The embodiment of the present disclosure is believed by obtaining the text location to include in piecemeal text according to the text point
The line of text location information for determining at least one line of text and the line of text is ceased, is determined described to include in piecemeal text
Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information,
The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described to piecemeal text
This at least one text block treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies
Text sections process improves the accuracy of text sections result.
Detailed description of the invention
In conjunction with attached drawing and refer to following specific embodiments, the above and other feature, advantage of each embodiment of the disclosure and
Aspect will be apparent.In attached drawing, the same or similar appended drawing reference indicates the same or similar element.It should manage
Solution attached drawing is schematically that original part and element are not necessarily drawn to scale.
Fig. 1 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 2 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 3 a is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 3 b is that the text block in a kind of text handling method that the embodiment of the present disclosure provides extracts result schematic diagram;
Fig. 3 c is the segmentation figure schematic diagram in a kind of text handling method that the embodiment of the present disclosure provides;
Fig. 3 d is the line of text cluster result schematic diagram in a kind of text handling method that the embodiment of the present disclosure provides;
Schematic diagram in a kind of text handling method that Fig. 3 e provides for the embodiment of the present disclosure to piecemeal text;
Piecemeal result schematic diagram in a kind of text handling method that Fig. 3 f provides for the embodiment of the present disclosure to piecemeal text;
Fig. 4 is a kind of structural schematic diagram for text processing apparatus that the embodiment of the present disclosure provides;
Fig. 5 is a kind of structural schematic diagram for terminal device that the embodiment of the present disclosure provides.
Specific embodiment
Embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the certain of the disclosure in attached drawing
Embodiment, it should be understood that, the disclosure can be realized by various forms, and should not be construed as being limited to this
In the embodiment that illustrates, providing these embodiments on the contrary is in order to more thorough and be fully understood by the disclosure.It should be understood that
It is that being given for example only property of the accompanying drawings and embodiments effect of the disclosure is not intended to limit the protection scope of the disclosure.
It should be appreciated that each step recorded in disclosed method embodiment can execute in a different order,
And/or parallel execution.In addition, method implementation may include additional step and/or omit the step of execution is shown.This public affairs
The range opened is not limited in this respect.
Terms used herein " comprising " and its deformation are that opening includes, i.e., " including but not limited to ".Term "based"
It is " being based at least partially on ".Term " one embodiment " expression " at least one embodiment ";Term " another embodiment " indicates
" at least one other embodiment ";Term " some embodiments " expression " at least some embodiments ".The correlation of other terms is fixed
Justice provides in will be described below.
It is noted that the concepts such as " first " that refers in the disclosure, " second " are only used for different devices, module or list
Member distinguishes, and is not intended to limit the sequence or relation of interdependence of function performed by these devices, module or unit.
It is noted that referred in the disclosure "one", the modification of " multiple " be schematically and not restrictive this field
It will be appreciated by the skilled person that being otherwise construed as " one or more " unless clearly indicate otherwise in context.
The being merely to illustrate property of title of the message or information that are interacted between multiple devices in disclosure embodiment
Purpose, and be not used to limit the range of these message or information.
In following each embodiments, optional feature and example are provided simultaneously in each embodiment, that records in embodiment is each
A feature can be combined, and form multiple optinal plans, and the embodiment of each number should not be considered merely as to a technical solution.
Embodiment one
Fig. 1 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure is applicable to
Situation when text sections is carried out to PDF text, this method can be executed by text processing apparatus, and text processing unit can
It is realized in a manner of using software and/or hardware, for example, text processing unit can be configured in terminal device.Such as Fig. 1 institute
Show, which comprises
The text location of S110, acquisition to include in piecemeal text, determines at least one according to text location
The line of text location information of line of text and line of text.
It in the embodiments of the present disclosure, can be the one or more pages for including in PDF text to piecemeal text.Text position
Confidence breath can be the text coordinate to each text in piecemeal text.It is understood that may include text in PDF text
The elements such as word, picture, and include the element information of all elements in the data flow of PDF text, for different element types,
Corresponding element information is different.Illustratively, when element type is text, element information can be coordinate, font, size etc.
Information, when element type is picture, element information can be the information such as coordinate, high width.
It is parsed by the data flow to PDF text, the coordinate of each text can be obtained.Illustratively, Ke Yigen
Judge whether the corresponding element of element information is text according to element information, when the corresponding element of element information is text, obtains
The coordinate for including in the element information is as text coordinate.Optionally, it can be determined that whether right comprising " font " in element information
The information answered, if in element information including " font " corresponding information, the corresponding element of decision element information is text.
After the text location for obtaining all texts after include in piecemeal text, believed according to the text point of each text
Cease the line of text location information for determining line of text and line of text.It optionally, can be according to the text coordinate of each text by text
It is divided into line of text and determines the line of text location information of each line of text.The mode that text is divided into line of text is not limited herein
It is fixed, optionally, can be identical by ordinate, the region that text of the distance between the abscissa in set distance threshold value is constituted
, can also be identical by ordinate as a line of text, the region that the continuous text of abscissa is constituted is as a line of text.
Wherein, distance threshold can divide library situation to set according to the text to piecemeal text.
Optionally, the location information of line of text may include apex coordinate (such as the top left corner apex coordinate, lower-left of line of text
Angular vertex coordinate, upper right corner apex coordinate or lower right corner apex coordinate), the width of the height of line of text and line of text.According to
After text coordinate determines line of text, for each line of text, the text coordinate for being located at this article current row at least side endpoint, base are determined
The apex coordinate of line of text is determined in the text coordinate for being located at line of text endpoint.Illustratively, if the vertex of setting line of text is sat
It is designated as the apex coordinate in the line of text upper left corner, then obtains the top left corner apex coordinate of the text of endpoint on the left of line of text as text
Capable apex coordinate.The corresponding element information of text is obtained, line of text is determined based on the font size for including in element information
Highly.
In the embodiments of the present disclosure, the width of line of text can be according to the text quantity and element for being included in line of text
Information determines, can also be determined according to the text coordinate of line of text endpoint text.Illustratively, it can determine in line of text and include
Text quantity, the width of line of text is determined based on character script size, text quantity and text spacing.Text can also be obtained
The text coordinate of current row two sides endpoint calculates the distance between the text coordinate of line of text two sides endpoint, by line of text two side ends
Width of the distance between the text of point as line of text.
In one embodiment, the text location includes text coordinate, described according to the text location
Determine the line of text location information of at least one line of text and the line of text, comprising: by abscissa is continuous and ordinate phase
Same text determines the text of the line of text according to the text location of text in the line of text as a line of text
Row location information.
Preferably, can the identical text of abscissa is continuous and ordinate as a line of text, according in line of text
Text location determine the line of text location information of line of text.Using abscissa, the identical text of continuous and ordinate is as one
A line of text can guarantee in each line of text marked off to be continuous text information, so that the division of line of text is more quasi-
Really.Wherein, determine that the mode of line of text location information can be found in foregoing description according to the location information of text in line of text, herein
It repeats no more.
S120, the determining cut-off rule information to include in piecemeal text, believe according to line of text location information and cut-off rule
Cease the target range determined between line of text.
It is more accurate in order to divide the text block to piecemeal text, it in the embodiments of the present disclosure, will be in piecemeal text
The parameter that the cut-off rule information for including is divided as text block is determined based on line of text location information and cut-off rule information
Target range between line of text carries out the division of text block based on the target range between line of text.Cut-off rule information is made
The parameter divided for text block is realized and will be closer, but it is practical not it is same it is text filed in two line of text
(being closer between such as two line of text, but there are cut-off rules between two line of text) is divided in different text blocks.
In the embodiments of the present disclosure, it can determine that the cut-off rule to include in piecemeal text is believed according to image segmentation algorithm
Breath.In view of image segmentation result may be influenced to the text in piecemeal text, make the cut-off rule information inaccuracy extracted,
In the embodiments of the present disclosure, first color filling will can be carried out to the word segment in piecemeal text, removes word segment to figure
As the influence of segmentation result.
Optionally, the picture after can use edge detection is only included the picture element matrix of the image of cut-off rule, if
If the pixel value of some pixel meets setting pixel value range in picture, then it represents that color change is larger at the pixel,
It may be cut-off rule.
Cut-off rule information in one embodiment, described in the determination to include in piecemeal text, comprising:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted
Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information
It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute
State cut-off rule information.
In order to treat the word segment in piecemeal text carry out color filling, need will be to line of text in piecemeal text except
Other regioinvertions be picture format, and gray processing is carried out to the obtained picture of conversion, obtains gray scale picture, be based on grayscale image
Piece carries out pixel value filling to line of text.It optionally, can be directly by background if being solid color to piecemeal text background color
Pixel value of the pixel value of color as line of text region;If including multiple color to piecemeal text background, for example gradient color can
Interpolation filling is carried out to the pixel value in line of text region with the pixel value based on line of text region surrounding pixel point, is obtained to be checked
Mapping piece.The pixel value in line of text region can be made to be closer to background pixel value using the mode that interpolation is filled, so that
The extraction of cut-off rule information is more accurate.Wherein, the mode that interpolation is filled is it is not limited here.Illustratively, it can be used double
The filling of linear interpolation progress line of text pixel values in regions.
After obtaining picture to be detected, detect that the cut-off rule for including in picture to be detected is believed using edge detection algorithm
Breath.In the embodiments of the present disclosure, without limitation to edge detection algorithm.Illustratively, Canny operator, Roberts can be used
Operator, Sobel operator etc. carry out edge detection to picture to be detected.
In one embodiment, can according to the cut-off rule information between line of text to the space length between line of text into
Row adjustment, obtains the target range between line of text.Illustratively, according to the sky between line of text positional information calculation line of text
Between distance, according to line of text location information and cut-off rule information judge cut-off rule between line of text there are situations, based on text
Between current row cut-off rule there are the adjusting parameter that situation determines space length, text is calculated according to space length and adjusting parameter
Target range between current row.Optionally, can preset cut-off rule between line of text there are situation and adjusting parameter it
Between corresponding relationship, determine cut-off rule between line of text there are after situation, determined by searching for preset corresponding relationship
The adjusting parameter of space length between line of text.Illustratively, space length and adjusting parameter can be subjected to summation or quadrature
Operation, using obtained operation result as target range.
Wherein, between line of text cut-off rule there are situation can between line of text there are between cut-off rule or line of text
Do not deposit cut-off rule, further, can also according to segmentation line length existing between line of text between line of text exist segmentation
The case where line, carries out further division, will such as divide line length and be divided into N number of length range, each length range is corresponding
There are situations as a kind of cut-off rule for situation;Or according to the ratio between segmentation line length and text line width between line of text
The case where there are cut-off rules carries out further division, and ratio is divided into M ratio range, each ratio range is corresponding
There are situations as a kind of cut-off rule for situation.Wherein, M, N are the integer greater than 1.
S130, line of text is clustered according to target range, is determined according to the cluster result of line of text to piecemeal text
At least one text block.
In the embodiments of the present disclosure, after determining the target range between every two line of text, according to the mesh between line of text
Subject distance clusters line of text, will be located at same category of text row set as a text block.Implement in the disclosure
In example, without limitation to clustering method.Illustratively, it is poly- that K mean cluster, hierarchical clustering algorithm, SOM neural network can be used
Class algorithm scheduling algorithm clusters line of text based on the target range between line of text.
It is described to be clustered the line of text according to the target range in one embodiment, according to the text
Capable cluster result determines described at least one text block to piecemeal text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information
The corresponding text block location information of classification.
Optionally, spectral clustering can be used to cluster line of text based on the target range between line of text.Tool
Body, using each line of text as a point, its mark is determined, using the target range between line of text as two node institute structures
At side right weight, obtain adjacency matrix.Illustratively, the i-th row jth arranges corresponding element d in adjacency matrixijElement value be text
Current row liWith line of text ljBetween target range.After obtaining adjacency matrix, Ncut cutting is carried out, the cluster knot of line of text is obtained
Fruit determines text block message according to the cluster result of line of text.
Illustratively, if cluster result is line of text 1, line of text 2, line of text 3 belong to cluster set 1, line of text 4, text
Current row 5 belongs to cluster set 2, then determines that line of text 1, line of text 2, line of text 3 constitute text block 1, line of text 4,5 structure of line of text
The location information of text block 1 is determined at cluster block 2, and according to the location information of line of text 1, line of text 2, line of text 3, according to text
Current row 4, line of text 5 location information determine the location information of text block 2.Illustratively, the location information of text block can wrap
Include the apex coordinate of text block, the width of text block, height of text block etc..The apex coordinate of text block can be according to text block
The apex coordinate of interior line of text determines that the width of text block can be determined according to the coordinate information of line of text each in text block, text
The height of this block can be determined according to the coordinate information of line of text each in text block.
The embodiment of the present disclosure is believed by obtaining the text location to include in piecemeal text according to the text point
The line of text location information for determining at least one line of text and the line of text is ceased, is determined described to include in piecemeal text
Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information,
The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described to piecemeal text
This at least one text block treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies
Text sections process improves the accuracy of text sections result.
Embodiment two
Fig. 2 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure can with it is upper
Each optinal plan in one or more embodiment is stated to combine.As shown in Figure 2, which comprises
The text location of S210, acquisition to include in piecemeal text, determines at least one according to text location
The line of text location information of line of text and line of text.
S220, the determining cut-off rule information to include in piecemeal text.
S230, space length between line of text is determined according to line of text location information.
In the embodiments of the present disclosure, space length computation rule can be preset, according to space length computation rule with
And the space length between line of text positional information calculation line of text.Optionally, the position of line of text can be indicated by 4 parameters
Confidence breath.Illustratively, line of text l is definediWith line of text ljBetween space length are as follows: d0(li, lj)=α | xi-xj|/max
(wi, wj)+|yi-yj|/max(hi, hj) wherein, d0(li, lj) indicate line of text liWith line of text ljBetween space length, (xi,
yi) it is line of text liThe position coordinates of top left corner apex, hiFor line of text liHighly, wiFor line of text liWidth, (xj, yj) it is text
Current row ljThe position coordinates of top left corner apex, hjFor line of text ljHighly, wjFor line of text ljWidth, α be control line-spacing and column away from
The parameter of different degree ratio, optionally, the value of α can be 1.5.
S240, determine the segmentation distance between line of text according to line of text location information and cut-off rule information, segmentation away from
From cut-point quantity existing between line of text.
In the embodiments of the present disclosure, the segmentation distance between line of text is embodied as existing cut-point between line of text
Quantity.In one embodiment, it is described according to the line of text location information and the cut-off rule information determine line of text it
Between segmentation distance, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range
Pixel value is greater than the pixel number of given threshold as the segmentation distance.
It optionally, can be according to line of text liLocation information and line of text ljLocation information determine cut-point identify
Range.Illustratively, if line of text liTop left corner apex coordinate be (xi, yi), line of text ljThe position coordinates of top left corner apex
For (xj, yj), then it can will meet xj≤p≤xi+wi, and yi+hi≤q≤yj(p, q) constitute the corresponding position range of point set
As cut-point identification range.After determining cut-point identification range, according to the pixel value of each pixel in cut-point identification range
Determine the number of cut-point in cut-point identification range.Illustratively, the pixel that pixel value can be greater than to given threshold is made
For cut-point.
Optionally, the cut-point for including between line of text can be determined based on the picture element matrix for the picture that edge detection obtains
Quantity.Illustratively, line of text l is definediWith line of text ljBetween segmentation distance are as follows:
Wherein, d1(li, lj) indicate line of text liWith line of text ljBetween segmentation distance, (xi, yi) it is line of text liUpper left
The position coordinates of angular vertex, hiFor line of text liHighly, wiFor line of text liWidth, (xj, yj) it is line of text ljTop left corner apex
Position coordinates, I (p, q) be in picture element matrix coordinate be (p, q) pixel pixel value, θ be preset pixel value
Threshold value, optionally, the value of θ can be 50.
S250, the target range between line of text is determined according to space length and segmentation distance.
After determining control distance between line of text and segmentation distance, according between line of text space length and segmentation
Distance calculates the target range between line of text.Optionally, space length and segmentation distance can be weighted summation operation,
Obtain target range.In one embodiment, it is described according to the space length and the segmentation distance determine line of text it
Between target range, comprising: the space length and the segmentation distance are weighted summation, obtain the target range.
Optionally, target range computation rule is defined are as follows: d (li, lj)=d0(li, lj)+λd1(li, lj).Wherein, d (li,
lj) it is line of text liWith line of text ljBetween target range, d0(li, lj) it is line of text liWith line of text ljBetween space away from
From d1(li, lj) it is line of text liWith line of text ljBetween segmentation distance, λ be divide distance weight.Wherein, the value of λ can be with
It is adjusted according to the location parameter of line of text.
S260, line of text is clustered according to target range, is determined according to the cluster result of line of text to piecemeal text
At least one text block.
The technical solution of the embodiment of the present disclosure, will be determined according to line of text location information and cut-off rule information line of text it
Between target range embodied, by determining the space length between line of text according to line of text location information, according to
Line of text location information and the cut-off rule information determine the segmentation distance between line of text, according to space length and segmentation
Distance determines the target range between line of text, so that target range is accurately calculated, so that being based on target range
Line of text cluster result it is more accurate.
Embodiment three
Fig. 3 a is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure is in above-mentioned reality
On the basis of applying example, a kind of preferred embodiment is provided.As shown in Figure 3a, which comprises
S310, beginning.
S320, pdf document is obtained.
Obtain user's input by pdf document.
S330, the text information for extracting pdf document.
Text information, including text coordinate, font size, wide height etc. are extracted from PDFW the file information stream.
S340, line of text is generated using text information.
Line of text is determined according to text location.Fig. 3 b is in a kind of text handling method that the embodiment of the present disclosure provides
Text block extract result schematic diagram.As shown in Figure 3b, black region is the line of text extracted in figure.
S350, pdf document is converted to gray processing picture, removes word segment using bilinear interpolation.
In order to embody other segmentation informations in document, the document except text is converted into picture, and carry out to picture
Gray processing processing, obtains gray processing picture, uses the word segment in bilinear interpolation filling picture.
S360, edge detection is carried out to picture, obtains segmentation figure.
Edge detection is carried out to picture using Canny operator, obtains the segmentation figure comprising cut-off rule.Fig. 3 c is that the disclosure is real
Segmentation figure schematic diagram in a kind of text handling method of example offer is provided.As shown in Figure 3c, white line extracts in figure
Cut-off rule.
The distance between S370, calculating line of text, obtain adjacency matrix.
The segmentation distance between line of text is calculated according to segmentation figure to be abutted in conjunction with the space length between line of text
Matrix.
S380, line of text cluster result, i.e. text sections are obtained using spectral clustering.
Adjacency matrix is based on using spectral clustering to cluster line of text, obtains cluster result, is determined according to cluster result
Text sections result.Fig. 3 d is the line of text cluster result signal in a kind of text handling method that the embodiment of the present disclosure provides
Figure.As shown in Figure 3d, line of text cluster be three classes, cluster set 1 in include line of text H, cluster set 2 in comprising line of text A,
Line of text B and line of text C, it includes line of text D, line of text G, line of text F and line of text E in 3 that cluster, which is gathered,.Then line of text H structure
At text block 1, line of text A, line of text B and line of text C constitute text block 2, line of text D, line of text G, line of text F and line of text
E constitutes text block 3.
S390, end.
Schematic diagram in a kind of text handling method that Fig. 3 e provides for the embodiment of the present disclosure to piecemeal text.Fig. 3 f is this
Piecemeal result schematic diagram in a kind of text handling method that open embodiment provides to piecemeal text.Fig. 3 e and Fig. 3 f are exemplary
Show using text handling method provided by the embodiment of the present disclosure carry out text sections piecemeal effect.Such as Fig. 3 f institute
Show, the first text block 301f, the second text block 302f, third text block 303f, the 4th text block 304f, the 5th text in Fig. 3 f
Block 305f, the 6th text block 306f, the 7th text block 307f, the 8th text block 308f, the 9th text block 309f, the tenth text block
310f, the 11st text block 311f are to as shown in Figure 3 e using text handling method provided by the embodiment of the present disclosure wait divide
Block text carries out the text block that text sections obtain.As can be seen that based on text handling method provided by the embodiment of the present disclosure
Obtained text sections result precision is higher.
Embodiment of the present disclosure PDF document information, character area and non-legible region are separately handled, and are avoided mutually dry
Disturb, extract the information such as cut-off rule picture using edge detection algorithm, carry out text segmentation, at the same consider text space length and
Divide distance, automatically derive text sections using clustering algorithm, realize without a large amount of rules or training data, it is only necessary to determine few
Parameter is measured, document can be accurately divided into text block.
Example IV
Fig. 4 is a kind of structural schematic diagram for text processing apparatus that the embodiment of the present disclosure provides.The embodiment of the present disclosure can fit
For carrying out situation when text sections to PDF text.Text processing unit can be real by the way of software and/or hardware
It is existing, for example, text processing unit can be configured at terminal device.As shown in figure 4, the text processing apparatus includes: text
Row determining module 410, target range determining module 420 and text block determining module 430.Wherein:
Line of text determining module 410, for obtaining the text location to include in piecemeal text, according to the text
Location information determines the line of text location information of at least one line of text and the line of text;
Target range determining module 420, it is described to be wrapped in piecemeal text for being determined according to the line of text location information
The cut-off rule information contained, according to the line of text location information and the cut-off rule information determine the target between line of text away from
From;
Text block determining module 430, for being clustered the line of text according to the target range, according to the text
The cluster result of current row determines described at least one text block to piecemeal text.
The embodiment of the present disclosure obtains the text location to include in piecemeal text by line of text determining module, according to
The text location determines the line of text location information of at least one line of text and the line of text, determines described wait divide
The cut-off rule information for including in block text, target range determining module is according to the line of text location information and the cut-off rule
Information determines that the target range between line of text, text block determining module gather the line of text according to the target range
Class determines described at least one text block to piecemeal text according to the cluster result of the line of text, is believed according to text point
Breath and cut-off rule information treat piecemeal text and carry out piecemeal, simplify text sections process, improve text sections result
Accuracy.
Optionally, based on the above technical solution, the target range determining module 420 includes:
Space length determination unit, for determining the space length between line of text according to the line of text location information;
Divide distance determining unit, for determining text according to the line of text location information and the cut-off rule information
Segmentation distance between row, segmentation distance existing cut-point quantity between the line of text;
Target range determination unit, for being determined between line of text according to the space length and segmentation distance
Target range.
Optionally, based on the above technical solution, the segmentation distance determining unit is specifically used for:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range
Pixel value is greater than the pixel number of given threshold as the segmentation distance.
Optionally, based on the above technical solution, the target range determination unit is specifically used for:
The space length and the segmentation distance are weighted summation, obtain the target range.
Optionally, based on the above technical solution, the target range determining module 410 is detected including segmentation information
Unit is used for:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted
Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information
It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute
State cut-off rule information.
Optionally, based on the above technical solution, the text block determining module 430 is specifically used for:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information
The corresponding text block location information of classification.
Optionally, based on the above technical solution, the text location includes text coordinate, the line of text
Determining module 410 is specifically used for:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text of text in the line of text
Word location information determines the line of text location information of the line of text.
Text-processing side provided by the embodiment of the present disclosure can be performed in text processing apparatus provided by the embodiment of the present disclosure
Method has the corresponding functional module of execution method and beneficial effect.
It is worth noting that, each unit included by above-mentioned apparatus and module are only divided according to function logic
, but be not limited to the above division, as long as corresponding functions can be realized;In addition, the specific name of each functional unit
Title is also only for convenience of distinguishing each other, and is not limited to the protection scope of the embodiment of the present disclosure.
Embodiment five
Below with reference to Fig. 5, it illustrates the structural representations for the terminal device 500 for being suitable for being used to realize the embodiment of the present disclosure
Figure.Terminal device in the embodiment of the present disclosure can include but is not limited to such as mobile phone, laptop, digital broadcasting and connect
Receive device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle
Carry navigation terminal) etc..Terminal device shown in Fig. 5 is only an example, function to the embodiment of the present disclosure and should not be made
With range band come any restrictions.
As shown in figure 5, terminal device 500 may include processing unit (such as central processing unit, graphics processor etc.)
501, random access can be loaded into according to the program being stored in read-only memory (ROM) 502 or from storage device 506
Program in memory (RAM) 503 and execute various movements appropriate and processing.In RAM 503, it is also stored with terminal device
Various programs and data needed for 500 operations.Processing unit 501, ROM 502 and RAM 503 pass through the phase each other of bus 504
Even.Input/output (I/O) interface 505 is also connected to bus 504.
In general, following device can connect to I/O interface 505: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph
As the input unit 506 of head, microphone, accelerometer, gyroscope etc.;Including such as liquid crystal display (LCD), loudspeaker, vibration
The output device 507 of dynamic device etc.;Storage device 506 including such as tape, hard disk etc.;And communication device 509.Communication device
509, which can permit terminal device 500, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 5 shows tool
There is the terminal device 500 of various devices, it should be understood that being not required for implementing or having all devices shown.It can be with
Alternatively implement or have more or fewer devices.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising being carried on non-transient computer can
The computer program on medium is read, which includes the program code for method shown in execution flow chart.At this
In the embodiment of sample, which can be downloaded and installed from network by communication device 509, or be filled from storage
It sets 506 to be mounted, or is mounted from ROM 502.When the computer program is executed by processing unit 501, the disclosure is executed
The above-mentioned function of being limited in the method for embodiment.
It should be noted that the above-mentioned computer-readable medium of the disclosure can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In open, computer-readable signal media may include in a base band or as the data-signal that carrier wave a part is propagated,
In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to
Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable and deposit
Any computer-readable medium other than storage media, the computer-readable signal media can send, propagate or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: electric wire, optical cable, RF (radio frequency) etc. are above-mentioned
Any appropriate combination.
In some embodiments, client, server can use such as HTTP (HyperText Transfer
Protocol, hypertext transfer protocol) etc the network protocols of any currently known or following research and development communicated, and can
To be interconnected with the digital data communications (for example, communication network) of arbitrary form or medium.The example of communication network includes local area network
(" LAN "), wide area network (" WAN "), Internet (for example, internet) and ad-hoc network are (for example, the end-to-end net of ad hoc
Network) and any currently known or following research and development network.
Above-mentioned computer-readable medium can be included in above-mentioned terminal device;It is also possible to individualism, and not
It is fitted into the terminal device.
Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are by the end
When end equipment executes, so that the terminal device:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location
The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point
Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text
At least one text block to piecemeal text.
The calculating of the operation for executing the disclosure can be write with one or more programming languages or combinations thereof
Machine program code, above procedure design language include but is not limited to object oriented program language-such as Java,
Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language
Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence
Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or
It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet
It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit
It is connected with ISP by internet).
Flow chart and block diagram in attached drawing, illustrate according to the method, apparatus of the various embodiments of the disclosure, terminal device and
The architecture, function and operation in the cards of computer program product.In this regard, each side in flowchart or block diagram
Frame can represent a part of a module, program segment or code, and a part of the module, program segment or code includes one
Or multiple executable instructions for implementing the specified logical function.It should also be noted that in some implementations as replacements, side
The function of being marked in frame can also occur in a different order than that indicated in the drawings.For example, two sides succeedingly indicated
Frame can actually be basically executed in parallel, they can also be executed in the opposite order sometimes, this according to related function and
It is fixed.It is also noted that the group of each box in block diagram and or flow chart and the box in block diagram and or flow chart
It closes, can be realized with the dedicated hardware based system for executing defined functions or operations, or specialized hardware can be used
Combination with computer instruction is realized.
Being described in the embodiment of the present disclosure involved module and unit can be realized by way of software, can also be with
It is realized by way of hardware.Wherein, module or the title of unit do not constitute under certain conditions to the module or
The restriction of unit itself, for example, line of text determining module is also described as " obtaining the text position to include in piecemeal text
Confidence breath, determines the line of text location information of at least one line of text and the line of text according to the text location
Module ".
Function described herein can be executed at least partly by one or more hardware logic components.Example
Such as, without limitation, the hardware logic component for the exemplary type that can be used include: field programmable gate array (FPGA), specially
With integrated circuit (ASIC), Application Specific Standard Product (ASSP), system on chip (SOC), complex programmable logic equipment (CPLD) etc.
Deng.
In the context of the disclosure, machine readable media can be tangible medium, may include or is stored for
The program that instruction execution system, device or equipment are used or is used in combination with instruction execution system, device or equipment.Machine can
Reading medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but is not limited to electricity
Son, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment or above content any conjunction
Suitable combination.The more specific example of machine readable storage medium will include the electrical connection of line based on one or more, portable meter
Calculation machine disk, hard disk, random access memory (RAM), read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM
Or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage facilities or
Any appropriate combination of above content.
According to one or more other embodiments of the present disclosure, example one provides a kind of text handling method, comprising:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location
The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point
Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text
At least one text block to piecemeal text.
According to one or more other embodiments of the present disclosure, example two provides a kind of text handling method, in example one
On the basis of text handling method, it is described according to the line of text location information and the cut-off rule information determine line of text it
Between target range, comprising:
The space length between line of text is determined according to the line of text location information;
The segmentation distance between line of text is determined according to the line of text location information and the cut-off rule information, it is described
Segmentation distance existing cut-point quantity between the line of text;
The target range between line of text is determined according to the space length and the segmentation distance.
According to one or more other embodiments of the present disclosure, example three provides a kind of text handling method, in example two
On the basis of text handling method, it is described according to the line of text location information and the cut-off rule information determine line of text it
Between segmentation distance, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range
Pixel value is greater than the pixel number of given threshold as the segmentation distance.
According to one or more other embodiments of the present disclosure, example four provides a kind of text handling method, in example two
On the basis of text handling method, the target according to the space length and between the determining line of text of segmentation distance
Distance, comprising:
The space length and the segmentation distance are weighted summation, obtain the target range.
According to one or more other embodiments of the present disclosure, example five provides a kind of text handling method, in example one
Cut-off rule information on the basis of text handling method, described in the determination to include in piecemeal text, comprising:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted
Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information
It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute
State cut-off rule information.
According to one or more other embodiments of the present disclosure, example six provides a kind of text handling method, in example one
It is described to be clustered the line of text according to the target range on the basis of text handling method, according to the line of text
Cluster result determine described at least one text block to piecemeal text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information
The corresponding text block location information of classification.
According to one or more other embodiments of the present disclosure, example seven provides a kind of text handling method, in example one
On the basis of text handling method, the text location includes text coordinate, described true according to the text location
The line of text location information of at least one fixed line of text and the line of text, comprising:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text of text in the line of text
Word location information determines the line of text location information of the line of text.
According to one or more other embodiments of the present disclosure, example eight provides a kind of text processing apparatus, comprising:
Line of text determining module, for obtaining the text location to include in piecemeal text, according to the text position
Confidence ceases the line of text location information for determining at least one line of text and the line of text;
Target range determining module, it is described to include in piecemeal text for being determined according to the line of text location information
Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the text
Capable cluster result determines described at least one text block to piecemeal text.
According to one or more other embodiments of the present disclosure, example nine provides a kind of terminal device, comprising:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of places
Manage text handling method of the device realization as described in any in example one to seven.
According to one or more other embodiments of the present disclosure, example ten provides a kind of computer readable storage medium, thereon
It is stored with computer program, is realized when which is executed by processor at the text as described in any in example one to seven
Reason method.
Above description is only the preferred embodiment of the disclosure and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that the open scope involved in the disclosure, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from design disclosed above, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed in the disclosure
Can technical characteristic replaced mutually and the technical solution that is formed.
Although this is not construed as requiring these operations with institute in addition, depicting each operation using certain order
The certain order that shows executes in sequential order to execute.Under certain environment, multitask and parallel processing may be advantageous
's.Similarly, although containing several specific implementation details in being discussed above, these are not construed as to this public affairs
The limitation for the range opened.Certain features described in the context of individual embodiment can also be realized in combination single real
It applies in example.On the contrary, the various features described in the context of single embodiment can also be individually or with any suitable
The mode of sub-portfolio is realized in various embodiments.
Although having used specific to this theme of the language description of structure feature and/or method logical action, answer
When understanding that theme defined in the appended claims is not necessarily limited to special characteristic described above or movement.On on the contrary,
Special characteristic described in face and movement are only to realize the exemplary forms of claims.
Claims (10)
1. a kind of text handling method characterized by comprising
The text location to include in piecemeal text is obtained, at least one line of text is determined according to the text location
And the line of text location information of the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and the cut-off rule
Information determines the target range between line of text;
The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described wait divide
At least one text block of block text.
2. the method according to claim 1, wherein described according to the line of text location information and described point
Secant information determines the target range between line of text, comprising:
The space length between line of text is determined according to the line of text location information;
The segmentation distance between line of text, the segmentation are determined according to the line of text location information and the cut-off rule information
Distance existing cut-point quantity between the line of text;
The target range between line of text is determined according to the space length and the segmentation distance.
3. according to the method described in claim 2, it is characterized in that, described according to the line of text location information and described point
Secant information determines the segmentation distance between line of text, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by the pixel of pixel in the cut-point identification range
Value is greater than the pixel number of given threshold as the segmentation distance.
4. according to the method described in claim 2, it is characterized in that, described according to the space length and the segmentation distance
Determine the target range between line of text, comprising:
The space length and the segmentation distance are weighted summation, obtain the target range.
5. the method according to claim 1, wherein the cut-off rule described in the determination to include in piecemeal text
Information, comprising:
Be picture format by described other regioinvertions to except line of text in piecemeal text, and by the obtained picture of conversion into
Row gray processing obtains gray scale picture;
It will be carried out in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information
Filling, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as described point
Secant information.
6. the method according to claim 1, wherein described carry out the line of text according to the target range
Cluster determines described at least one text block to piecemeal text according to the cluster result of the line of text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, the classification is determined according to same category of line of text location information
Corresponding text block location information.
7. the method according to claim 1, wherein the text location includes text coordinate, described
The line of text location information of at least one line of text and the line of text is determined according to the text location, comprising:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text position of text in the line of text
Confidence breath determines the line of text location information of the line of text.
8. a kind of text processing apparatus characterized by comprising
Line of text determining module is believed for obtaining the text location to include in piecemeal text according to the text point
Cease the line of text location information for determining at least one line of text and the line of text;
Target range determining module, for determining the segmentation to include in piecemeal text according to the line of text location information
Line information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the line of text
Cluster result determines described at least one text block to piecemeal text.
9. a kind of terminal device, which is characterized in that the terminal device includes:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of processing fill
Set the text handling method realized as described in any in claim 1-7.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt
The text handling method as described in any in claim 1-7 is realized when processor executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910734656.9A CN110442719B (en) | 2019-08-09 | 2019-08-09 | Text processing method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910734656.9A CN110442719B (en) | 2019-08-09 | 2019-08-09 | Text processing method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110442719A true CN110442719A (en) | 2019-11-12 |
CN110442719B CN110442719B (en) | 2022-03-04 |
Family
ID=68434244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910734656.9A Active CN110442719B (en) | 2019-08-09 | 2019-08-09 | Text processing method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442719B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680491A (en) * | 2020-05-27 | 2020-09-18 | 北京字节跳动科技有限公司 | Document information extraction method and device and electronic equipment |
CN113177959A (en) * | 2021-05-21 | 2021-07-27 | 广州普华灵动机器人技术有限公司 | QR code real-time extraction algorithm in rapid movement process |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120102388A1 (en) * | 2010-10-26 | 2012-04-26 | Jian Fan | Text segmentation of a document |
CN107832756A (en) * | 2017-10-24 | 2018-03-23 | 讯飞智元信息科技有限公司 | Express delivery list information extracting method and device, storage medium, electronic equipment |
US20190205362A1 (en) * | 2017-12-29 | 2019-07-04 | Konica Minolta Laboratory U.S.A., Inc. | Method for inferring blocks of text in electronic documents |
-
2019
- 2019-08-09 CN CN201910734656.9A patent/CN110442719B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120102388A1 (en) * | 2010-10-26 | 2012-04-26 | Jian Fan | Text segmentation of a document |
CN107832756A (en) * | 2017-10-24 | 2018-03-23 | 讯飞智元信息科技有限公司 | Express delivery list information extracting method and device, storage medium, electronic equipment |
US20190205362A1 (en) * | 2017-12-29 | 2019-07-04 | Konica Minolta Laboratory U.S.A., Inc. | Method for inferring blocks of text in electronic documents |
Non-Patent Citations (2)
Title |
---|
张充等: "基于最小生成树聚类的中文版面分割法", 《计算机工程》 * |
路松峰等: "面向移动设备的WEB页面分块算法", 《小型微型计算机系统》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680491A (en) * | 2020-05-27 | 2020-09-18 | 北京字节跳动科技有限公司 | Document information extraction method and device and electronic equipment |
CN111680491B (en) * | 2020-05-27 | 2024-02-02 | 北京字跳网络技术有限公司 | Method and device for extracting document information and electronic equipment |
CN113177959A (en) * | 2021-05-21 | 2021-07-27 | 广州普华灵动机器人技术有限公司 | QR code real-time extraction algorithm in rapid movement process |
CN113177959B (en) * | 2021-05-21 | 2022-05-03 | 广州普华灵动机器人技术有限公司 | QR code real-time extraction method in rapid movement process |
Also Published As
Publication number | Publication date |
---|---|
CN110442719B (en) | 2022-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021244270A1 (en) | Image processing method and apparatus, device, and computer readable storage medium | |
EP4040401A1 (en) | Image processing method and apparatus, device and storage medium | |
CN107644209A (en) | Method for detecting human face and device | |
CN109508681A (en) | The method and apparatus for generating human body critical point detection model | |
CN108304835A (en) | character detecting method and device | |
CN107622240B (en) | Face detection method and device | |
WO2019232772A1 (en) | Systems and methods for content identification | |
CN108229303A (en) | Detection identification and the detection identification training method of network and device, equipment, medium | |
CN108280477A (en) | Method and apparatus for clustering image | |
CN108229341A (en) | Sorting technique and device, electronic equipment, computer storage media, program | |
Du et al. | Segmentation and sampling method for complex polyline generalization based on a generative adversarial network | |
CN115457531A (en) | Method and device for recognizing text | |
JP2023501820A (en) | Face parsing methods and related devices | |
CN113807399A (en) | Neural network training method, neural network detection method and neural network detection device | |
Cheng et al. | Building simplification using backpropagation neural networks: a combination of cartographers' expertise and raster-based local perception | |
US11651191B2 (en) | Methods, apparatuses, and computer program products using a repeated convolution-based attention module for improved neural network implementations | |
CN110222726A (en) | Image processing method, device and electronic equipment | |
US11734799B2 (en) | Point cloud feature enhancement and apparatus, computer device and storage medium | |
CN110457677A (en) | Entity-relationship recognition method and device, storage medium, computer equipment | |
US11429841B1 (en) | Feedback adversarial learning | |
CN112069412B (en) | Information recommendation method, device, computer equipment and storage medium | |
CN107644208A (en) | Method for detecting human face and device | |
CN113822207A (en) | Hyperspectral remote sensing image identification method and device, electronic equipment and storage medium | |
CN114067389A (en) | Facial expression classification method and electronic equipment | |
EP4425423A1 (en) | Image processing method and apparatus, device, storage medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |