CN110442719A - A kind of text handling method, device, equipment and storage medium - Google Patents

A kind of text handling method, device, equipment and storage medium Download PDF

Info

Publication number
CN110442719A
CN110442719A CN201910734656.9A CN201910734656A CN110442719A CN 110442719 A CN110442719 A CN 110442719A CN 201910734656 A CN201910734656 A CN 201910734656A CN 110442719 A CN110442719 A CN 110442719A
Authority
CN
China
Prior art keywords
text
line
information
piecemeal
cut
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910734656.9A
Other languages
Chinese (zh)
Other versions
CN110442719B (en
Inventor
张航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201910734656.9A priority Critical patent/CN110442719B/en
Publication of CN110442719A publication Critical patent/CN110442719A/en
Application granted granted Critical
Publication of CN110442719B publication Critical patent/CN110442719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the present disclosure discloses a kind of text handling method, device, equipment and storage medium, the described method includes: obtaining the text location to include in piecemeal text, the line of text location information of at least one line of text and the line of text is determined according to the text location, determine the cut-off rule information to include in piecemeal text, the target range between line of text is determined according to the line of text location information and the cut-off rule information, the line of text is clustered according to the target range, described at least one text block to piecemeal text is determined according to the cluster result of the line of text.The method that the embodiment of the present disclosure provides treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies text sections process, improves the accuracy of text sections result.

Description

A kind of text handling method, device, equipment and storage medium
Technical field
The embodiment of the present disclosure is related to information technology field more particularly to a kind of text handling method, device, equipment and storage Medium.
Background technique
Portable Document format (Portable Document Format, PDF) is a kind of with independently of application program, hard Part, operating system mode the file format of document is presented.Pdf document can restore document styles well, but because of its main mesh It is to guarantee rendering result, the structural information of content is caused to be ignored.The thus logical construction between PDF document content or semantic knot Structure can not directly acquire, so being difficult to structuring well.If not doing text sections to PDF document, directly extraction text meeting There is the problem of sequence entanglement.Therefore it needs to outline character area, guarantees that character order is correct inside block.According still further on to Under, sequence from left to right arranges block.Therefore text sections are the bases of PDF document structuring.
Currently, text sections method includes the transverse and longitudinal coordinate by page elements, convert two-dimensional surface segmentation problem to One-dimensional character string parsing problem, followed by the method for partition that rule distinguishes corresponding element, according to point of shape operation Algorithm is cut, Thiessen polygon (Voronoi) algorithm constrains distance of swimming algorithm or the Region detection algorithms based on deep learning etc.. But current text sections method needs to be arranged a large amount of rules and parameter, recognition result accuracy is not high, or a large amount of numbers of mark According to being trained, process is cumbersome.
Summary of the invention
The disclosure provides a kind of text handling method, device, equipment and storage medium, simplifies text sections mistake to realize Journey improves the accuracy of text sections result.
In a first aspect, the embodiment of the present disclosure provides a kind of text handling method, comprising:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text At least one text block to piecemeal text.
Second aspect, the embodiment of the present disclosure additionally provide a kind of text processing apparatus, comprising:
Line of text determining module, for obtaining the text location to include in piecemeal text, according to the text position Confidence ceases the line of text location information for determining at least one line of text and the line of text;
Target range determining module, it is described to include in piecemeal text for being determined according to the line of text location information Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the text Capable cluster result determines described at least one text block to piecemeal text.
The third aspect, the embodiment of the present disclosure additionally provide terminal device, which is characterized in that the terminal device includes:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of places Manage text handling method of the device realization as described in the embodiment of the present disclosure is any.
Fourth aspect, the embodiment of the present disclosure additionally provide a kind of computer readable storage medium, and the computer is executable Instruction by computer processor when being executed for executing the text handling method as described in the embodiment of the present disclosure is any.
The embodiment of the present disclosure is believed by obtaining the text location to include in piecemeal text according to the text point The line of text location information for determining at least one line of text and the line of text is ceased, is determined described to include in piecemeal text Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information, The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described to piecemeal text This at least one text block treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies Text sections process improves the accuracy of text sections result.
Detailed description of the invention
In conjunction with attached drawing and refer to following specific embodiments, the above and other feature, advantage of each embodiment of the disclosure and Aspect will be apparent.In attached drawing, the same or similar appended drawing reference indicates the same or similar element.It should manage Solution attached drawing is schematically that original part and element are not necessarily drawn to scale.
Fig. 1 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 2 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 3 a is a kind of flow chart for text handling method that the embodiment of the present disclosure provides;
Fig. 3 b is that the text block in a kind of text handling method that the embodiment of the present disclosure provides extracts result schematic diagram;
Fig. 3 c is the segmentation figure schematic diagram in a kind of text handling method that the embodiment of the present disclosure provides;
Fig. 3 d is the line of text cluster result schematic diagram in a kind of text handling method that the embodiment of the present disclosure provides;
Schematic diagram in a kind of text handling method that Fig. 3 e provides for the embodiment of the present disclosure to piecemeal text;
Piecemeal result schematic diagram in a kind of text handling method that Fig. 3 f provides for the embodiment of the present disclosure to piecemeal text;
Fig. 4 is a kind of structural schematic diagram for text processing apparatus that the embodiment of the present disclosure provides;
Fig. 5 is a kind of structural schematic diagram for terminal device that the embodiment of the present disclosure provides.
Specific embodiment
Embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the certain of the disclosure in attached drawing Embodiment, it should be understood that, the disclosure can be realized by various forms, and should not be construed as being limited to this In the embodiment that illustrates, providing these embodiments on the contrary is in order to more thorough and be fully understood by the disclosure.It should be understood that It is that being given for example only property of the accompanying drawings and embodiments effect of the disclosure is not intended to limit the protection scope of the disclosure.
It should be appreciated that each step recorded in disclosed method embodiment can execute in a different order, And/or parallel execution.In addition, method implementation may include additional step and/or omit the step of execution is shown.This public affairs The range opened is not limited in this respect.
Terms used herein " comprising " and its deformation are that opening includes, i.e., " including but not limited to ".Term "based" It is " being based at least partially on ".Term " one embodiment " expression " at least one embodiment ";Term " another embodiment " indicates " at least one other embodiment ";Term " some embodiments " expression " at least some embodiments ".The correlation of other terms is fixed Justice provides in will be described below.
It is noted that the concepts such as " first " that refers in the disclosure, " second " are only used for different devices, module or list Member distinguishes, and is not intended to limit the sequence or relation of interdependence of function performed by these devices, module or unit.
It is noted that referred in the disclosure "one", the modification of " multiple " be schematically and not restrictive this field It will be appreciated by the skilled person that being otherwise construed as " one or more " unless clearly indicate otherwise in context.
The being merely to illustrate property of title of the message or information that are interacted between multiple devices in disclosure embodiment Purpose, and be not used to limit the range of these message or information.
In following each embodiments, optional feature and example are provided simultaneously in each embodiment, that records in embodiment is each A feature can be combined, and form multiple optinal plans, and the embodiment of each number should not be considered merely as to a technical solution.
Embodiment one
Fig. 1 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure is applicable to Situation when text sections is carried out to PDF text, this method can be executed by text processing apparatus, and text processing unit can It is realized in a manner of using software and/or hardware, for example, text processing unit can be configured in terminal device.Such as Fig. 1 institute Show, which comprises
The text location of S110, acquisition to include in piecemeal text, determines at least one according to text location The line of text location information of line of text and line of text.
It in the embodiments of the present disclosure, can be the one or more pages for including in PDF text to piecemeal text.Text position Confidence breath can be the text coordinate to each text in piecemeal text.It is understood that may include text in PDF text The elements such as word, picture, and include the element information of all elements in the data flow of PDF text, for different element types, Corresponding element information is different.Illustratively, when element type is text, element information can be coordinate, font, size etc. Information, when element type is picture, element information can be the information such as coordinate, high width.
It is parsed by the data flow to PDF text, the coordinate of each text can be obtained.Illustratively, Ke Yigen Judge whether the corresponding element of element information is text according to element information, when the corresponding element of element information is text, obtains The coordinate for including in the element information is as text coordinate.Optionally, it can be determined that whether right comprising " font " in element information The information answered, if in element information including " font " corresponding information, the corresponding element of decision element information is text.
After the text location for obtaining all texts after include in piecemeal text, believed according to the text point of each text Cease the line of text location information for determining line of text and line of text.It optionally, can be according to the text coordinate of each text by text It is divided into line of text and determines the line of text location information of each line of text.The mode that text is divided into line of text is not limited herein It is fixed, optionally, can be identical by ordinate, the region that text of the distance between the abscissa in set distance threshold value is constituted , can also be identical by ordinate as a line of text, the region that the continuous text of abscissa is constituted is as a line of text. Wherein, distance threshold can divide library situation to set according to the text to piecemeal text.
Optionally, the location information of line of text may include apex coordinate (such as the top left corner apex coordinate, lower-left of line of text Angular vertex coordinate, upper right corner apex coordinate or lower right corner apex coordinate), the width of the height of line of text and line of text.According to After text coordinate determines line of text, for each line of text, the text coordinate for being located at this article current row at least side endpoint, base are determined The apex coordinate of line of text is determined in the text coordinate for being located at line of text endpoint.Illustratively, if the vertex of setting line of text is sat It is designated as the apex coordinate in the line of text upper left corner, then obtains the top left corner apex coordinate of the text of endpoint on the left of line of text as text Capable apex coordinate.The corresponding element information of text is obtained, line of text is determined based on the font size for including in element information Highly.
In the embodiments of the present disclosure, the width of line of text can be according to the text quantity and element for being included in line of text Information determines, can also be determined according to the text coordinate of line of text endpoint text.Illustratively, it can determine in line of text and include Text quantity, the width of line of text is determined based on character script size, text quantity and text spacing.Text can also be obtained The text coordinate of current row two sides endpoint calculates the distance between the text coordinate of line of text two sides endpoint, by line of text two side ends Width of the distance between the text of point as line of text.
In one embodiment, the text location includes text coordinate, described according to the text location Determine the line of text location information of at least one line of text and the line of text, comprising: by abscissa is continuous and ordinate phase Same text determines the text of the line of text according to the text location of text in the line of text as a line of text Row location information.
Preferably, can the identical text of abscissa is continuous and ordinate as a line of text, according in line of text Text location determine the line of text location information of line of text.Using abscissa, the identical text of continuous and ordinate is as one A line of text can guarantee in each line of text marked off to be continuous text information, so that the division of line of text is more quasi- Really.Wherein, determine that the mode of line of text location information can be found in foregoing description according to the location information of text in line of text, herein It repeats no more.
S120, the determining cut-off rule information to include in piecemeal text, believe according to line of text location information and cut-off rule Cease the target range determined between line of text.
It is more accurate in order to divide the text block to piecemeal text, it in the embodiments of the present disclosure, will be in piecemeal text The parameter that the cut-off rule information for including is divided as text block is determined based on line of text location information and cut-off rule information Target range between line of text carries out the division of text block based on the target range between line of text.Cut-off rule information is made The parameter divided for text block is realized and will be closer, but it is practical not it is same it is text filed in two line of text (being closer between such as two line of text, but there are cut-off rules between two line of text) is divided in different text blocks.
In the embodiments of the present disclosure, it can determine that the cut-off rule to include in piecemeal text is believed according to image segmentation algorithm Breath.In view of image segmentation result may be influenced to the text in piecemeal text, make the cut-off rule information inaccuracy extracted, In the embodiments of the present disclosure, first color filling will can be carried out to the word segment in piecemeal text, removes word segment to figure As the influence of segmentation result.
Optionally, the picture after can use edge detection is only included the picture element matrix of the image of cut-off rule, if If the pixel value of some pixel meets setting pixel value range in picture, then it represents that color change is larger at the pixel, It may be cut-off rule.
Cut-off rule information in one embodiment, described in the determination to include in piecemeal text, comprising:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute State cut-off rule information.
In order to treat the word segment in piecemeal text carry out color filling, need will be to line of text in piecemeal text except Other regioinvertions be picture format, and gray processing is carried out to the obtained picture of conversion, obtains gray scale picture, be based on grayscale image Piece carries out pixel value filling to line of text.It optionally, can be directly by background if being solid color to piecemeal text background color Pixel value of the pixel value of color as line of text region;If including multiple color to piecemeal text background, for example gradient color can Interpolation filling is carried out to the pixel value in line of text region with the pixel value based on line of text region surrounding pixel point, is obtained to be checked Mapping piece.The pixel value in line of text region can be made to be closer to background pixel value using the mode that interpolation is filled, so that The extraction of cut-off rule information is more accurate.Wherein, the mode that interpolation is filled is it is not limited here.Illustratively, it can be used double The filling of linear interpolation progress line of text pixel values in regions.
After obtaining picture to be detected, detect that the cut-off rule for including in picture to be detected is believed using edge detection algorithm Breath.In the embodiments of the present disclosure, without limitation to edge detection algorithm.Illustratively, Canny operator, Roberts can be used Operator, Sobel operator etc. carry out edge detection to picture to be detected.
In one embodiment, can according to the cut-off rule information between line of text to the space length between line of text into Row adjustment, obtains the target range between line of text.Illustratively, according to the sky between line of text positional information calculation line of text Between distance, according to line of text location information and cut-off rule information judge cut-off rule between line of text there are situations, based on text Between current row cut-off rule there are the adjusting parameter that situation determines space length, text is calculated according to space length and adjusting parameter Target range between current row.Optionally, can preset cut-off rule between line of text there are situation and adjusting parameter it Between corresponding relationship, determine cut-off rule between line of text there are after situation, determined by searching for preset corresponding relationship The adjusting parameter of space length between line of text.Illustratively, space length and adjusting parameter can be subjected to summation or quadrature Operation, using obtained operation result as target range.
Wherein, between line of text cut-off rule there are situation can between line of text there are between cut-off rule or line of text Do not deposit cut-off rule, further, can also according to segmentation line length existing between line of text between line of text exist segmentation The case where line, carries out further division, will such as divide line length and be divided into N number of length range, each length range is corresponding There are situations as a kind of cut-off rule for situation;Or according to the ratio between segmentation line length and text line width between line of text The case where there are cut-off rules carries out further division, and ratio is divided into M ratio range, each ratio range is corresponding There are situations as a kind of cut-off rule for situation.Wherein, M, N are the integer greater than 1.
S130, line of text is clustered according to target range, is determined according to the cluster result of line of text to piecemeal text At least one text block.
In the embodiments of the present disclosure, after determining the target range between every two line of text, according to the mesh between line of text Subject distance clusters line of text, will be located at same category of text row set as a text block.Implement in the disclosure In example, without limitation to clustering method.Illustratively, it is poly- that K mean cluster, hierarchical clustering algorithm, SOM neural network can be used Class algorithm scheduling algorithm clusters line of text based on the target range between line of text.
It is described to be clustered the line of text according to the target range in one embodiment, according to the text Capable cluster result determines described at least one text block to piecemeal text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information The corresponding text block location information of classification.
Optionally, spectral clustering can be used to cluster line of text based on the target range between line of text.Tool Body, using each line of text as a point, its mark is determined, using the target range between line of text as two node institute structures At side right weight, obtain adjacency matrix.Illustratively, the i-th row jth arranges corresponding element d in adjacency matrixijElement value be text Current row liWith line of text ljBetween target range.After obtaining adjacency matrix, Ncut cutting is carried out, the cluster knot of line of text is obtained Fruit determines text block message according to the cluster result of line of text.
Illustratively, if cluster result is line of text 1, line of text 2, line of text 3 belong to cluster set 1, line of text 4, text Current row 5 belongs to cluster set 2, then determines that line of text 1, line of text 2, line of text 3 constitute text block 1, line of text 4,5 structure of line of text The location information of text block 1 is determined at cluster block 2, and according to the location information of line of text 1, line of text 2, line of text 3, according to text Current row 4, line of text 5 location information determine the location information of text block 2.Illustratively, the location information of text block can wrap Include the apex coordinate of text block, the width of text block, height of text block etc..The apex coordinate of text block can be according to text block The apex coordinate of interior line of text determines that the width of text block can be determined according to the coordinate information of line of text each in text block, text The height of this block can be determined according to the coordinate information of line of text each in text block.
The embodiment of the present disclosure is believed by obtaining the text location to include in piecemeal text according to the text point The line of text location information for determining at least one line of text and the line of text is ceased, is determined described to include in piecemeal text Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information, The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described to piecemeal text This at least one text block treats piecemeal text according to text location and cut-off rule information and carries out piecemeal, simplifies Text sections process improves the accuracy of text sections result.
Embodiment two
Fig. 2 is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure can with it is upper Each optinal plan in one or more embodiment is stated to combine.As shown in Figure 2, which comprises
The text location of S210, acquisition to include in piecemeal text, determines at least one according to text location The line of text location information of line of text and line of text.
S220, the determining cut-off rule information to include in piecemeal text.
S230, space length between line of text is determined according to line of text location information.
In the embodiments of the present disclosure, space length computation rule can be preset, according to space length computation rule with And the space length between line of text positional information calculation line of text.Optionally, the position of line of text can be indicated by 4 parameters Confidence breath.Illustratively, line of text l is definediWith line of text ljBetween space length are as follows: d0(li, lj)=α | xi-xj|/max (wi, wj)+|yi-yj|/max(hi, hj) wherein, d0(li, lj) indicate line of text liWith line of text ljBetween space length, (xi, yi) it is line of text liThe position coordinates of top left corner apex, hiFor line of text liHighly, wiFor line of text liWidth, (xj, yj) it is text Current row ljThe position coordinates of top left corner apex, hjFor line of text ljHighly, wjFor line of text ljWidth, α be control line-spacing and column away from The parameter of different degree ratio, optionally, the value of α can be 1.5.
S240, determine the segmentation distance between line of text according to line of text location information and cut-off rule information, segmentation away from From cut-point quantity existing between line of text.
In the embodiments of the present disclosure, the segmentation distance between line of text is embodied as existing cut-point between line of text Quantity.In one embodiment, it is described according to the line of text location information and the cut-off rule information determine line of text it Between segmentation distance, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range Pixel value is greater than the pixel number of given threshold as the segmentation distance.
It optionally, can be according to line of text liLocation information and line of text ljLocation information determine cut-point identify Range.Illustratively, if line of text liTop left corner apex coordinate be (xi, yi), line of text ljThe position coordinates of top left corner apex For (xj, yj), then it can will meet xj≤p≤xi+wi, and yi+hi≤q≤yj(p, q) constitute the corresponding position range of point set As cut-point identification range.After determining cut-point identification range, according to the pixel value of each pixel in cut-point identification range Determine the number of cut-point in cut-point identification range.Illustratively, the pixel that pixel value can be greater than to given threshold is made For cut-point.
Optionally, the cut-point for including between line of text can be determined based on the picture element matrix for the picture that edge detection obtains Quantity.Illustratively, line of text l is definediWith line of text ljBetween segmentation distance are as follows:
Wherein, d1(li, lj) indicate line of text liWith line of text ljBetween segmentation distance, (xi, yi) it is line of text liUpper left The position coordinates of angular vertex, hiFor line of text liHighly, wiFor line of text liWidth, (xj, yj) it is line of text ljTop left corner apex Position coordinates, I (p, q) be in picture element matrix coordinate be (p, q) pixel pixel value, θ be preset pixel value Threshold value, optionally, the value of θ can be 50.
S250, the target range between line of text is determined according to space length and segmentation distance.
After determining control distance between line of text and segmentation distance, according between line of text space length and segmentation Distance calculates the target range between line of text.Optionally, space length and segmentation distance can be weighted summation operation, Obtain target range.In one embodiment, it is described according to the space length and the segmentation distance determine line of text it Between target range, comprising: the space length and the segmentation distance are weighted summation, obtain the target range.
Optionally, target range computation rule is defined are as follows: d (li, lj)=d0(li, lj)+λd1(li, lj).Wherein, d (li, lj) it is line of text liWith line of text ljBetween target range, d0(li, lj) it is line of text liWith line of text ljBetween space away from From d1(li, lj) it is line of text liWith line of text ljBetween segmentation distance, λ be divide distance weight.Wherein, the value of λ can be with It is adjusted according to the location parameter of line of text.
S260, line of text is clustered according to target range, is determined according to the cluster result of line of text to piecemeal text At least one text block.
The technical solution of the embodiment of the present disclosure, will be determined according to line of text location information and cut-off rule information line of text it Between target range embodied, by determining the space length between line of text according to line of text location information, according to Line of text location information and the cut-off rule information determine the segmentation distance between line of text, according to space length and segmentation Distance determines the target range between line of text, so that target range is accurately calculated, so that being based on target range Line of text cluster result it is more accurate.
Embodiment three
Fig. 3 a is a kind of flow chart for text handling method that the embodiment of the present disclosure provides.The embodiment of the present disclosure is in above-mentioned reality On the basis of applying example, a kind of preferred embodiment is provided.As shown in Figure 3a, which comprises
S310, beginning.
S320, pdf document is obtained.
Obtain user's input by pdf document.
S330, the text information for extracting pdf document.
Text information, including text coordinate, font size, wide height etc. are extracted from PDFW the file information stream.
S340, line of text is generated using text information.
Line of text is determined according to text location.Fig. 3 b is in a kind of text handling method that the embodiment of the present disclosure provides Text block extract result schematic diagram.As shown in Figure 3b, black region is the line of text extracted in figure.
S350, pdf document is converted to gray processing picture, removes word segment using bilinear interpolation.
In order to embody other segmentation informations in document, the document except text is converted into picture, and carry out to picture Gray processing processing, obtains gray processing picture, uses the word segment in bilinear interpolation filling picture.
S360, edge detection is carried out to picture, obtains segmentation figure.
Edge detection is carried out to picture using Canny operator, obtains the segmentation figure comprising cut-off rule.Fig. 3 c is that the disclosure is real Segmentation figure schematic diagram in a kind of text handling method of example offer is provided.As shown in Figure 3c, white line extracts in figure Cut-off rule.
The distance between S370, calculating line of text, obtain adjacency matrix.
The segmentation distance between line of text is calculated according to segmentation figure to be abutted in conjunction with the space length between line of text Matrix.
S380, line of text cluster result, i.e. text sections are obtained using spectral clustering.
Adjacency matrix is based on using spectral clustering to cluster line of text, obtains cluster result, is determined according to cluster result Text sections result.Fig. 3 d is the line of text cluster result signal in a kind of text handling method that the embodiment of the present disclosure provides Figure.As shown in Figure 3d, line of text cluster be three classes, cluster set 1 in include line of text H, cluster set 2 in comprising line of text A, Line of text B and line of text C, it includes line of text D, line of text G, line of text F and line of text E in 3 that cluster, which is gathered,.Then line of text H structure At text block 1, line of text A, line of text B and line of text C constitute text block 2, line of text D, line of text G, line of text F and line of text E constitutes text block 3.
S390, end.
Schematic diagram in a kind of text handling method that Fig. 3 e provides for the embodiment of the present disclosure to piecemeal text.Fig. 3 f is this Piecemeal result schematic diagram in a kind of text handling method that open embodiment provides to piecemeal text.Fig. 3 e and Fig. 3 f are exemplary Show using text handling method provided by the embodiment of the present disclosure carry out text sections piecemeal effect.Such as Fig. 3 f institute Show, the first text block 301f, the second text block 302f, third text block 303f, the 4th text block 304f, the 5th text in Fig. 3 f Block 305f, the 6th text block 306f, the 7th text block 307f, the 8th text block 308f, the 9th text block 309f, the tenth text block 310f, the 11st text block 311f are to as shown in Figure 3 e using text handling method provided by the embodiment of the present disclosure wait divide Block text carries out the text block that text sections obtain.As can be seen that based on text handling method provided by the embodiment of the present disclosure Obtained text sections result precision is higher.
Embodiment of the present disclosure PDF document information, character area and non-legible region are separately handled, and are avoided mutually dry Disturb, extract the information such as cut-off rule picture using edge detection algorithm, carry out text segmentation, at the same consider text space length and Divide distance, automatically derive text sections using clustering algorithm, realize without a large amount of rules or training data, it is only necessary to determine few Parameter is measured, document can be accurately divided into text block.
Example IV
Fig. 4 is a kind of structural schematic diagram for text processing apparatus that the embodiment of the present disclosure provides.The embodiment of the present disclosure can fit For carrying out situation when text sections to PDF text.Text processing unit can be real by the way of software and/or hardware It is existing, for example, text processing unit can be configured at terminal device.As shown in figure 4, the text processing apparatus includes: text Row determining module 410, target range determining module 420 and text block determining module 430.Wherein:
Line of text determining module 410, for obtaining the text location to include in piecemeal text, according to the text Location information determines the line of text location information of at least one line of text and the line of text;
Target range determining module 420, it is described to be wrapped in piecemeal text for being determined according to the line of text location information The cut-off rule information contained, according to the line of text location information and the cut-off rule information determine the target between line of text away from From;
Text block determining module 430, for being clustered the line of text according to the target range, according to the text The cluster result of current row determines described at least one text block to piecemeal text.
The embodiment of the present disclosure obtains the text location to include in piecemeal text by line of text determining module, according to The text location determines the line of text location information of at least one line of text and the line of text, determines described wait divide The cut-off rule information for including in block text, target range determining module is according to the line of text location information and the cut-off rule Information determines that the target range between line of text, text block determining module gather the line of text according to the target range Class determines described at least one text block to piecemeal text according to the cluster result of the line of text, is believed according to text point Breath and cut-off rule information treat piecemeal text and carry out piecemeal, simplify text sections process, improve text sections result Accuracy.
Optionally, based on the above technical solution, the target range determining module 420 includes:
Space length determination unit, for determining the space length between line of text according to the line of text location information;
Divide distance determining unit, for determining text according to the line of text location information and the cut-off rule information Segmentation distance between row, segmentation distance existing cut-point quantity between the line of text;
Target range determination unit, for being determined between line of text according to the space length and segmentation distance Target range.
Optionally, based on the above technical solution, the segmentation distance determining unit is specifically used for:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range Pixel value is greater than the pixel number of given threshold as the segmentation distance.
Optionally, based on the above technical solution, the target range determination unit is specifically used for:
The space length and the segmentation distance are weighted summation, obtain the target range.
Optionally, based on the above technical solution, the target range determining module 410 is detected including segmentation information Unit is used for:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute State cut-off rule information.
Optionally, based on the above technical solution, the text block determining module 430 is specifically used for:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information The corresponding text block location information of classification.
Optionally, based on the above technical solution, the text location includes text coordinate, the line of text Determining module 410 is specifically used for:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text of text in the line of text Word location information determines the line of text location information of the line of text.
Text-processing side provided by the embodiment of the present disclosure can be performed in text processing apparatus provided by the embodiment of the present disclosure Method has the corresponding functional module of execution method and beneficial effect.
It is worth noting that, each unit included by above-mentioned apparatus and module are only divided according to function logic , but be not limited to the above division, as long as corresponding functions can be realized;In addition, the specific name of each functional unit Title is also only for convenience of distinguishing each other, and is not limited to the protection scope of the embodiment of the present disclosure.
Embodiment five
Below with reference to Fig. 5, it illustrates the structural representations for the terminal device 500 for being suitable for being used to realize the embodiment of the present disclosure Figure.Terminal device in the embodiment of the present disclosure can include but is not limited to such as mobile phone, laptop, digital broadcasting and connect Receive device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle Carry navigation terminal) etc..Terminal device shown in Fig. 5 is only an example, function to the embodiment of the present disclosure and should not be made With range band come any restrictions.
As shown in figure 5, terminal device 500 may include processing unit (such as central processing unit, graphics processor etc.) 501, random access can be loaded into according to the program being stored in read-only memory (ROM) 502 or from storage device 506 Program in memory (RAM) 503 and execute various movements appropriate and processing.In RAM 503, it is also stored with terminal device Various programs and data needed for 500 operations.Processing unit 501, ROM 502 and RAM 503 pass through the phase each other of bus 504 Even.Input/output (I/O) interface 505 is also connected to bus 504.
In general, following device can connect to I/O interface 505: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 506 of head, microphone, accelerometer, gyroscope etc.;Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 507 of dynamic device etc.;Storage device 506 including such as tape, hard disk etc.;And communication device 509.Communication device 509, which can permit terminal device 500, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 5 shows tool There is the terminal device 500 of various devices, it should be understood that being not required for implementing or having all devices shown.It can be with Alternatively implement or have more or fewer devices.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising being carried on non-transient computer can The computer program on medium is read, which includes the program code for method shown in execution flow chart.At this In the embodiment of sample, which can be downloaded and installed from network by communication device 509, or be filled from storage It sets 506 to be mounted, or is mounted from ROM 502.When the computer program is executed by processing unit 501, the disclosure is executed The above-mentioned function of being limited in the method for embodiment.
It should be noted that the above-mentioned computer-readable medium of the disclosure can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In open, computer-readable signal media may include in a base band or as the data-signal that carrier wave a part is propagated, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable and deposit Any computer-readable medium other than storage media, the computer-readable signal media can send, propagate or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: electric wire, optical cable, RF (radio frequency) etc. are above-mentioned Any appropriate combination.
In some embodiments, client, server can use such as HTTP (HyperText Transfer Protocol, hypertext transfer protocol) etc the network protocols of any currently known or following research and development communicated, and can To be interconnected with the digital data communications (for example, communication network) of arbitrary form or medium.The example of communication network includes local area network (" LAN "), wide area network (" WAN "), Internet (for example, internet) and ad-hoc network are (for example, the end-to-end net of ad hoc Network) and any currently known or following research and development network.
Above-mentioned computer-readable medium can be included in above-mentioned terminal device;It is also possible to individualism, and not It is fitted into the terminal device.
Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are by the end When end equipment executes, so that the terminal device:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text At least one text block to piecemeal text.
The calculating of the operation for executing the disclosure can be write with one or more programming languages or combinations thereof Machine program code, above procedure design language include but is not limited to object oriented program language-such as Java, Smalltalk, C++ further include conventional procedural programming language-such as " C " language or similar program design language Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).
Flow chart and block diagram in attached drawing, illustrate according to the method, apparatus of the various embodiments of the disclosure, terminal device and The architecture, function and operation in the cards of computer program product.In this regard, each side in flowchart or block diagram Frame can represent a part of a module, program segment or code, and a part of the module, program segment or code includes one Or multiple executable instructions for implementing the specified logical function.It should also be noted that in some implementations as replacements, side The function of being marked in frame can also occur in a different order than that indicated in the drawings.For example, two sides succeedingly indicated Frame can actually be basically executed in parallel, they can also be executed in the opposite order sometimes, this according to related function and It is fixed.It is also noted that the group of each box in block diagram and or flow chart and the box in block diagram and or flow chart It closes, can be realized with the dedicated hardware based system for executing defined functions or operations, or specialized hardware can be used Combination with computer instruction is realized.
Being described in the embodiment of the present disclosure involved module and unit can be realized by way of software, can also be with It is realized by way of hardware.Wherein, module or the title of unit do not constitute under certain conditions to the module or The restriction of unit itself, for example, line of text determining module is also described as " obtaining the text position to include in piecemeal text Confidence breath, determines the line of text location information of at least one line of text and the line of text according to the text location Module ".
Function described herein can be executed at least partly by one or more hardware logic components.Example Such as, without limitation, the hardware logic component for the exemplary type that can be used include: field programmable gate array (FPGA), specially With integrated circuit (ASIC), Application Specific Standard Product (ASSP), system on chip (SOC), complex programmable logic equipment (CPLD) etc. Deng.
In the context of the disclosure, machine readable media can be tangible medium, may include or is stored for The program that instruction execution system, device or equipment are used or is used in combination with instruction execution system, device or equipment.Machine can Reading medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but is not limited to electricity Son, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment or above content any conjunction Suitable combination.The more specific example of machine readable storage medium will include the electrical connection of line based on one or more, portable meter Calculation machine disk, hard disk, random access memory (RAM), read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM Or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage facilities or Any appropriate combination of above content.
According to one or more other embodiments of the present disclosure, example one provides a kind of text handling method, comprising:
The text location to include in piecemeal text is obtained, at least one text is determined according to the text location The line of text location information of current row and the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and described point Secant information determines the target range between line of text;
The line of text is clustered according to the target range, according to the determination of the cluster result of the line of text At least one text block to piecemeal text.
According to one or more other embodiments of the present disclosure, example two provides a kind of text handling method, in example one On the basis of text handling method, it is described according to the line of text location information and the cut-off rule information determine line of text it Between target range, comprising:
The space length between line of text is determined according to the line of text location information;
The segmentation distance between line of text is determined according to the line of text location information and the cut-off rule information, it is described Segmentation distance existing cut-point quantity between the line of text;
The target range between line of text is determined according to the space length and the segmentation distance.
According to one or more other embodiments of the present disclosure, example three provides a kind of text handling method, in example two On the basis of text handling method, it is described according to the line of text location information and the cut-off rule information determine line of text it Between segmentation distance, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by pixel in the cut-point identification range Pixel value is greater than the pixel number of given threshold as the segmentation distance.
According to one or more other embodiments of the present disclosure, example four provides a kind of text handling method, in example two On the basis of text handling method, the target according to the space length and between the determining line of text of segmentation distance Distance, comprising:
The space length and the segmentation distance are weighted summation, obtain the target range.
According to one or more other embodiments of the present disclosure, example five provides a kind of text handling method, in example one Cut-off rule information on the basis of text handling method, described in the determination to include in piecemeal text, comprising:
It is picture format by described other regioinvertions to except line of text in piecemeal text, and obtained figure will be converted Piece carries out gray processing, obtains gray scale picture;
By in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information It is filled, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as institute State cut-off rule information.
According to one or more other embodiments of the present disclosure, example six provides a kind of text handling method, in example one It is described to be clustered the line of text according to the target range on the basis of text handling method, according to the line of text Cluster result determine described at least one text block to piecemeal text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, according to the determination of same category of line of text location information The corresponding text block location information of classification.
According to one or more other embodiments of the present disclosure, example seven provides a kind of text handling method, in example one On the basis of text handling method, the text location includes text coordinate, described true according to the text location The line of text location information of at least one fixed line of text and the line of text, comprising:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text of text in the line of text Word location information determines the line of text location information of the line of text.
According to one or more other embodiments of the present disclosure, example eight provides a kind of text processing apparatus, comprising:
Line of text determining module, for obtaining the text location to include in piecemeal text, according to the text position Confidence ceases the line of text location information for determining at least one line of text and the line of text;
Target range determining module, it is described to include in piecemeal text for being determined according to the line of text location information Cut-off rule information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the text Capable cluster result determines described at least one text block to piecemeal text.
According to one or more other embodiments of the present disclosure, example nine provides a kind of terminal device, comprising:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of places Manage text handling method of the device realization as described in any in example one to seven.
According to one or more other embodiments of the present disclosure, example ten provides a kind of computer readable storage medium, thereon It is stored with computer program, is realized when which is executed by processor at the text as described in any in example one to seven Reason method.
Above description is only the preferred embodiment of the disclosure and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that the open scope involved in the disclosure, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from design disclosed above, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed in the disclosure Can technical characteristic replaced mutually and the technical solution that is formed.
Although this is not construed as requiring these operations with institute in addition, depicting each operation using certain order The certain order that shows executes in sequential order to execute.Under certain environment, multitask and parallel processing may be advantageous 's.Similarly, although containing several specific implementation details in being discussed above, these are not construed as to this public affairs The limitation for the range opened.Certain features described in the context of individual embodiment can also be realized in combination single real It applies in example.On the contrary, the various features described in the context of single embodiment can also be individually or with any suitable The mode of sub-portfolio is realized in various embodiments.
Although having used specific to this theme of the language description of structure feature and/or method logical action, answer When understanding that theme defined in the appended claims is not necessarily limited to special characteristic described above or movement.On on the contrary, Special characteristic described in face and movement are only to realize the exemplary forms of claims.

Claims (10)

1. a kind of text handling method characterized by comprising
The text location to include in piecemeal text is obtained, at least one line of text is determined according to the text location And the line of text location information of the line of text;
The cut-off rule information to include in piecemeal text is determined, according to the line of text location information and the cut-off rule Information determines the target range between line of text;
The line of text is clustered according to the target range, is determined according to the cluster result of the line of text described wait divide At least one text block of block text.
2. the method according to claim 1, wherein described according to the line of text location information and described point Secant information determines the target range between line of text, comprising:
The space length between line of text is determined according to the line of text location information;
The segmentation distance between line of text, the segmentation are determined according to the line of text location information and the cut-off rule information Distance existing cut-point quantity between the line of text;
The target range between line of text is determined according to the space length and the segmentation distance.
3. according to the method described in claim 2, it is characterized in that, described according to the line of text location information and described point Secant information determines the segmentation distance between line of text, comprising:
Cut-point identification range is determined according to the line of text location information;
The pixel value for obtaining pixel in the cut-point identification range, by the pixel of pixel in the cut-point identification range Value is greater than the pixel number of given threshold as the segmentation distance.
4. according to the method described in claim 2, it is characterized in that, described according to the space length and the segmentation distance Determine the target range between line of text, comprising:
The space length and the segmentation distance are weighted summation, obtain the target range.
5. the method according to claim 1, wherein the cut-off rule described in the determination to include in piecemeal text Information, comprising:
Be picture format by described other regioinvertions to except line of text in piecemeal text, and by the obtained picture of conversion into Row gray processing obtains gray scale picture;
It will be carried out in the gray scale picture with the pixel value to the pixel in the corresponding region of piecemeal text position information Filling, obtains picture to be detected;
Edge detection is carried out to the picture to be detected by edge detection algorithm, the marginal information that will test out is as described point Secant information.
6. the method according to claim 1, wherein described carry out the line of text according to the target range Cluster determines described at least one text block to piecemeal text according to the cluster result of the line of text, comprising:
Determine that line of text clusters corresponding adjacency matrix according to the target range between line of text;
Line of text is clustered based on the adjacency matrix, the classification is determined according to same category of line of text location information Corresponding text block location information.
7. the method according to claim 1, wherein the text location includes text coordinate, described The line of text location information of at least one line of text and the line of text is determined according to the text location, comprising:
Using abscissa, the identical text of continuous and ordinate is as a line of text, according to the text position of text in the line of text Confidence breath determines the line of text location information of the line of text.
8. a kind of text processing apparatus characterized by comprising
Line of text determining module is believed for obtaining the text location to include in piecemeal text according to the text point Cease the line of text location information for determining at least one line of text and the line of text;
Target range determining module, for determining the segmentation to include in piecemeal text according to the line of text location information Line information determines the target range between line of text according to the line of text location information and the cut-off rule information;
Text block determining module, for being clustered the line of text according to the target range, according to the line of text Cluster result determines described at least one text block to piecemeal text.
9. a kind of terminal device, which is characterized in that the terminal device includes:
One or more processing units;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processing units, so that one or more of processing fill Set the text handling method realized as described in any in claim 1-7.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt The text handling method as described in any in claim 1-7 is realized when processor executes.
CN201910734656.9A 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium Active CN110442719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910734656.9A CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910734656.9A CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110442719A true CN110442719A (en) 2019-11-12
CN110442719B CN110442719B (en) 2022-03-04

Family

ID=68434244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910734656.9A Active CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110442719B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680491A (en) * 2020-05-27 2020-09-18 北京字节跳动科技有限公司 Document information extraction method and device and electronic equipment
CN113177959A (en) * 2021-05-21 2021-07-27 广州普华灵动机器人技术有限公司 QR code real-time extraction algorithm in rapid movement process

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120102388A1 (en) * 2010-10-26 2012-04-26 Jian Fan Text segmentation of a document
CN107832756A (en) * 2017-10-24 2018-03-23 讯飞智元信息科技有限公司 Express delivery list information extracting method and device, storage medium, electronic equipment
US20190205362A1 (en) * 2017-12-29 2019-07-04 Konica Minolta Laboratory U.S.A., Inc. Method for inferring blocks of text in electronic documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120102388A1 (en) * 2010-10-26 2012-04-26 Jian Fan Text segmentation of a document
CN107832756A (en) * 2017-10-24 2018-03-23 讯飞智元信息科技有限公司 Express delivery list information extracting method and device, storage medium, electronic equipment
US20190205362A1 (en) * 2017-12-29 2019-07-04 Konica Minolta Laboratory U.S.A., Inc. Method for inferring blocks of text in electronic documents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张充等: "基于最小生成树聚类的中文版面分割法", 《计算机工程》 *
路松峰等: "面向移动设备的WEB页面分块算法", 《小型微型计算机系统》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680491A (en) * 2020-05-27 2020-09-18 北京字节跳动科技有限公司 Document information extraction method and device and electronic equipment
CN111680491B (en) * 2020-05-27 2024-02-02 北京字跳网络技术有限公司 Method and device for extracting document information and electronic equipment
CN113177959A (en) * 2021-05-21 2021-07-27 广州普华灵动机器人技术有限公司 QR code real-time extraction algorithm in rapid movement process
CN113177959B (en) * 2021-05-21 2022-05-03 广州普华灵动机器人技术有限公司 QR code real-time extraction method in rapid movement process

Also Published As

Publication number Publication date
CN110442719B (en) 2022-03-04

Similar Documents

Publication Publication Date Title
WO2021244270A1 (en) Image processing method and apparatus, device, and computer readable storage medium
EP4040401A1 (en) Image processing method and apparatus, device and storage medium
CN107644209A (en) Method for detecting human face and device
CN109508681A (en) The method and apparatus for generating human body critical point detection model
CN108304835A (en) character detecting method and device
CN107622240B (en) Face detection method and device
WO2019232772A1 (en) Systems and methods for content identification
CN108229303A (en) Detection identification and the detection identification training method of network and device, equipment, medium
CN108280477A (en) Method and apparatus for clustering image
CN108229341A (en) Sorting technique and device, electronic equipment, computer storage media, program
Du et al. Segmentation and sampling method for complex polyline generalization based on a generative adversarial network
CN115457531A (en) Method and device for recognizing text
JP2023501820A (en) Face parsing methods and related devices
CN113807399A (en) Neural network training method, neural network detection method and neural network detection device
Cheng et al. Building simplification using backpropagation neural networks: a combination of cartographers' expertise and raster-based local perception
US11651191B2 (en) Methods, apparatuses, and computer program products using a repeated convolution-based attention module for improved neural network implementations
CN110222726A (en) Image processing method, device and electronic equipment
US11734799B2 (en) Point cloud feature enhancement and apparatus, computer device and storage medium
CN110457677A (en) Entity-relationship recognition method and device, storage medium, computer equipment
US11429841B1 (en) Feedback adversarial learning
CN112069412B (en) Information recommendation method, device, computer equipment and storage medium
CN107644208A (en) Method for detecting human face and device
CN113822207A (en) Hyperspectral remote sensing image identification method and device, electronic equipment and storage medium
CN114067389A (en) Facial expression classification method and electronic equipment
EP4425423A1 (en) Image processing method and apparatus, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant