CN113886582A

CN113886582A - Document processing method and device, and data extraction method and device for image

Info

Publication number: CN113886582A
Application number: CN202111156200.2A
Authority: CN
Inventors: 黄海平
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The disclosure provides a document processing method, relates to the technical field of computers, and particularly relates to the technical field of document processing. The specific implementation scheme is as follows: generating a plurality of first enclosure frames according to the position information of the line character images in the document page; generating a plurality of second enclosing frames according to the position information of the first enclosing frames, wherein each second enclosing frame is used for marking a text sparse region in the document page; executing merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes; aiming at a plurality of candidate surrounding frames, determining a plurality of local images of the document page according to the position information of each candidate surrounding frame; and generating a target image according to the content in the plurality of local images. The disclosure also provides a document processing device, a data extraction method and device for an image, an electronic device and a storage medium.

Description

Document processing method and device, and data extraction method and device for image

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to the field of document processing technology. More specifically, the present disclosure provides a document processing method and apparatus, a data extraction method and apparatus for an image, an electronic device, and a storage medium.

Background

One or more charts may be contained in the document. The data of these charts may be unstructured data, such as pictures, background pictures, and the like. In the related art, diagrams in a document can be manually intercepted, and feature points (e.g., coordinate axis origins, scale line ends, etc.) and data values in the diagrams can be observed so as to extract structured data from the diagrams.

Disclosure of Invention

The disclosure provides a document processing method and device, a data extraction method and device for an image, an electronic device and a storage medium.

According to a first aspect, there is provided a document processing method, the method comprising: generating a plurality of first enclosure frames according to the position information of the line character images in the document page; generating a plurality of second enclosing frames according to the position information of the plurality of first enclosing frames, wherein each second enclosing frame is used for marking a text sparse region in the document page; executing merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes; aiming at the candidate surrounding frames, determining a plurality of local images of the document page according to the position information of each candidate surrounding frame; and generating a target image according to the contents in the plurality of local images.

According to a second aspect, there is provided a data extraction method for an image, the method comprising: determining the coordinates of N mark points positioned on a coordinate axis in a target image according to the pixel value of each pixel in the target image; according to the coordinates of the N marking points, dividing the target image to obtain N +1 sub-areas; executing text recognition operation aiming at the ith sub-area in the N +1 sub-areas to obtain ith group of data corresponding to the ith sub-area; 1..., N + 1; the target image is generated according to the document processing method provided by the disclosure.

According to a third aspect, there is provided a document processing apparatus comprising: the first generation module is used for generating a plurality of first enclosure frames according to the position information of the line character images in the document page; a second generating module, configured to generate a plurality of second bounding boxes according to the location information of the plurality of first bounding boxes, where each second bounding box is used to mark a text sparse region in the document page; the merging module is used for executing merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes; a first determining module, configured to determine, for the candidate bounding boxes, a plurality of partial images of the document page according to the position information of each candidate bounding box; and a third generation module, configured to generate a target image according to the content in the plurality of local images.

According to a fourth aspect, there is provided a data extraction apparatus for an image, the apparatus comprising: the second determining module is used for determining the coordinates of N mark points positioned on the coordinate axis in the target image according to the pixel value of each pixel in the target image; the dividing module is used for executing dividing operation on the target image according to the coordinates of the N marking points to obtain N +1 sub-areas; the text recognition module is used for executing text recognition operation on the ith sub-area in the N +1 sub-areas to obtain ith group of data corresponding to the ith sub-area; 1..., N + 1; wherein the target image is generated by the document processing device provided by the present disclosure.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided in accordance with the present disclosure.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which document processing methods and apparatus and data extraction methods for images may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a document processing method according to one embodiment of the present disclosure;

3A-3B are schematic and schematic diagrams of a document processing method according to one embodiment of the present disclosure;

4A-4C are schematic and schematic diagrams of a document processing method according to one embodiment of the present disclosure;

FIG. 5 is a flow diagram of a data extraction method for an image according to one embodiment of the present disclosure;

fig. 6A to 6B are schematic and schematic diagrams of a data extraction method for an image according to one embodiment of the present disclosure;

FIG. 7 is a block diagram of a document processing device according to one embodiment of the present disclosure;

FIG. 8 is a block diagram of a data extraction device for an image according to one embodiment of the present disclosure; and

fig. 9 is a block diagram of an electronic device of a document processing method and/or a data extraction method for an image according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In an example method, the structured data of the chart in the document can be manually extracted, the cost is high, and the accuracy of data extraction is not high.

Fig. 1 is a schematic diagram of an exemplary system architecture to which a document processing method and apparatus and/or a data extraction method and apparatus for charts may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include a plurality of terminal devices 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and so forth.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Terminal device 101 may be a variety of electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, and the like.

At least one of the document processing method and the data extraction method for the image provided by the embodiment of the present disclosure may be generally performed by the server 103. Accordingly, the document processing apparatus and the data extraction apparatus for an image provided by the embodiments of the present disclosure may be generally disposed in the server 103. The document processing method and the data extraction method for the image provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and is capable of communicating with the terminal device 101 and/or the server 103. Accordingly, the document processing apparatus and the data extraction apparatus for the image provided by the embodiment of the present disclosure may also be provided in a server or a server cluster different from the server 103 and capable of communicating with the terminal device 101 and/or the server 103.

FIG. 2 is a flow diagram of a document processing method according to one embodiment of the present disclosure.

As shown in fig. 2, the document processing method 200 may include operations S210 to S250.

In operation S210, a plurality of first bounding boxes are generated according to the position information of the line text images in the document page.

In the embodiment of the present disclosure, the Document may be a PDF (Portable Document Format) Document.

For example, the document may be a searchable PDF document or a PDF document created by other editing class applications.

For example, the document may be a PDF document containing only images. In one example, a PDF document containing only images may be created by a scanning operation.

For example, the document may be a doc (document) document. In one example, each page of a DOC document is graphically represented as a result of data corruption.

In the embodiment of the present disclosure, the position information of the line character image can be obtained according to the position information of each character image.

For example, the PDF document may be analyzed by open source software such as XPDFReader, so as to obtain the position information of each character image in the PDF document. In one example, the searchable PDF document or a PDF document created by other editing applications may be parsed by open source software such as an XPDFReader to obtain the position information of each text image in the PDF document. In another example, OCR (optical character recognition) may be performed on a PDF document only containing an image, a text layer may be created in the PDF document only containing an image, and then, using open source software such as XPDFReader, the position information of each text image in the PDF document may be obtained.

For example, each text image in a PDF document has positional information, such as coordinates (e.g., top-left vertex coordinates of the text image), height, and width, relative to the top-left vertex of the page on which it is located. And generating a fourth sub-surrounding frame according to the ordinate values of the character images, wherein the ordinate values of each character image in the fourth sub-surrounding frame are the same or similar. And then, according to the ordinate values of the fourth sub-surrounding frames (the average value of the ordinate values of each character image thereof), calculating the intervals (such as the distance between edges) of a plurality of fourth sub-surrounding frames within the preset range. And generating a fifth sub-surrounding frame according to the space of the plurality of fourth sub-surrounding frames. The fifth sub-surrounding frame comprises at least one fourth sub-surrounding frame, and the distance between the at least one fourth sub-surrounding frame is smaller than a preset distance threshold. And obtaining the position information of the line-text digital image according to the position information of the fifth sub-surrounding frame. The position information of the line character image is obtained according to the position information of each character image in the line. And generating a first enclosure frame according to the position information of the line character image. The first surrounding frame comprises at least one fifth sub-surrounding frame. The vertical coordinate values of each character image in the first surrounding frame are the same or the difference of the vertical coordinate values is within a preset difference range.

In operation S220, a plurality of second bounding boxes, each for marking one text sparse region in a document page, are generated according to the position information of the plurality of first bounding boxes.

In the embodiment of the present disclosure, the text sparse region may be a blank region without text or a region in which the number of text is less than a preset number threshold.

For example, a text sparse region may be a region in a document where a chart is located.

In an embodiment of the present disclosure, the second surrounding frame includes a first sub-surrounding frame and a second sub-surrounding frame.

In the embodiment of the present disclosure, one first sub-enclosure frame is generated between any two first enclosure frames adjacent up and down. In an embodiment of the present disclosure, the width of the first sub-bounding box may be the width of a page.

For example, the first bounding box is generated based on the position information of the lines of the textual images, with blank regions between the lines of textual images. The blank region may be a text sparse region, and may be marked with a first sub-bounding box.

In the disclosed embodiment, one second sub-enclosure frame is generated on the left and/or right side of each first enclosure frame. In the disclosed embodiment, the width of the second sub-enclosure frame is the length from the edge of the first enclosure frame to the edge of the document page.

For example, a document may have margins, and then blank areas on the left and right sides of the line text image. These blank regions may be sparse regions of text, which may be marked with a second sub-bounding box.

In the disclosed embodiment, the document page may contain at least two columns of text.

For example, in some papers, each document page contains two columns of text.

In the embodiment of the disclosure, for each of the plurality of first bounding boxes, at least one overlapping region is determined according to the position information of the first bounding box and the position information of the second bounding box, and one overlapping region corresponds to at least one second bounding box.

For example, in a document containing two columns of text, for each line of textual image in each column of text, a plurality of first bounding boxes and a plurality of second bounding boxes are generated. In a document containing two columns of characters, the position information of the line character images in the two columns of characters is not completely consistent. Therefore, there may be an overlapping region between the first enclosure frame and the plurality of second enclosure frames.

In the embodiment of the present disclosure, for at least one overlapping area, the overlapping area is removed from at least one second bounding box corresponding to each overlapping area, respectively, so as to obtain a plurality of adjusted second bounding boxes.

For example, each overlap region may be removed from the second bounding box corresponding to that overlap region. For another example, the overlapping region and the regions above or below the overlapping region may be removed from at least one second bounding box corresponding to each overlapping region.

In operation S230, a merging operation is performed on the adjacent second bounding boxes, resulting in a plurality of candidate bounding boxes.

In an embodiment of the present disclosure, the candidate bounding box is a rectangle.

For example, the width of the second sub-bounding box generated to the left of the first bounding box may be the length of the left side outline of the first bounding box to the left edge of the document page.

For another example, the width of the second sub-bounding box generated to the right of the first bounding box may be the length of the right side frame line of the first bounding box to the right side edge of the document page.

In this embodiment of the present disclosure, after the merging operation is performed on the adjacent second bounding boxes, the obtained width of the candidate bounding box may be the same as that of the second sub-bounding box, and the height of the obtained candidate bounding box may be greater than or equal to the height of the merged second bounding box.

In the embodiment of the present disclosure, a dividing operation is performed on each first sub-surrounding frame, and a plurality of third sub-surrounding frames equal to the width of the second sub-surrounding frame are obtained.

For example, the width of the first sub-surrounding frame may be greater than the width of the second sub-surrounding frame. The width of the first sub-bounding box needs to be adjusted to perform the merge operation. And after the dividing operation is executed, obtaining a plurality of third sub-surrounding frames and the rest parts of the first sub-surrounding frames except the third sub-surrounding frames.

In the embodiment of the present disclosure, a merging operation is performed on the third sub-bounding box and the second sub-bounding box, so as to obtain a candidate bounding box.

For example, when a plurality of second surrounding frames are adjacent, the width of the candidate surrounding frame may be the same as that of the second sub-surrounding frame, and the height of the candidate surrounding frame may be greater than or equal to the height of the adjacent second and third sub-surrounding frames.

For example, after the merging operation is performed, the at least one third sub-bounding box is merged with the at least one second sub-bounding box, and the resulting box may be used as a candidate bounding box.

For another example, after the merge operation is performed, the remaining portion of the first sub-bounding box may be used as a candidate bounding box.

In operation S240, a plurality of partial images of a document page are determined for a plurality of candidate bounding boxes according to position information of each candidate bounding box

For example, each candidate bounding box may determine an area on the document page. From the region, an image within the region may be determined.

In operation S250, a target image is generated according to contents in the plurality of partial images.

For example, the content in the plurality of partial images may be a graph, or may be a blank or landscape graph. A partial image whose contents are charts may be taken as a target image.

For example, a white border may be removed from the target image according to the margin of the document page.

By the aid of the method and the device, the image can be a text sparse region in the document page, so that the chart can be accurately extracted from the document page through the operation. Especially, the chart can be accurately extracted from a thesis with a plurality of columns of characters. The method can quickly and accurately extract the charts from a large number of papers, and saves a large amount of manpower.

Fig. 3A-3B are schematic diagrams according to one embodiment of the present disclosure.

As shown in FIG. 3A, the document page 301 contains three line-text digital images, and three first bounding boxes can be generated according to the position information of each line-text digital image. The three first enclosure frames include, for example, the first enclosure frame 3021 and the first enclosure frame 3022 in fig. 3A, and a first enclosure frame corresponding to "this line of text is a second example".

From the position information of the three first enclosure frames, a plurality of second enclosure frames, for example, four second sub-enclosure frames 3032 in fig. 3A, may be generated. Fig. 3A also shows a first sub-surrounding frame 30311 and a first sub-surrounding frame 30312.

As shown in fig. 3B, a merge operation may be performed on the adjacent second bounding boxes to obtain four candidate bounding boxes, such as candidate bounding box 3041, candidate bounding box 3042, and candidate bounding box 3043 in fig. 3B. The remaining portion of the first sub-enclosure frame between the first enclosure frame and the first enclosure frame 3021 labeled "this line of text is only an example" may also be a candidate enclosure frame. After performing the merge operation, the candidate bounding box 3043 is the remaining portion of the first sub-bounding box 30312. For the four candidate bounding boxes, a plurality of partial images of the document page 301 may be determined from the position information of each candidate bounding box. From the contents of the plurality of partial images, a target image may be determined, for example, one partial image determined from the candidate bounding box 3043 is determined as the target image.

Fig. 4A to 4C are schematic diagrams according to another embodiment of the present disclosure.

As shown in FIG. 4A, document page 401 includes two columns of text. The column of text on the left side of document page 401 includes four lines of digital images, while the column of text on the right side of document page 401 includes one line of digital images and one chart 405.

Five first enclosure boxes, for example, the first enclosure box 4021 and the first enclosure box 4022 in fig. 4A, may be generated from the position information of the line character images in the document page.

A plurality of second bounding boxes may be generated according to the position information of the five first bounding boxes. The second bounding box may include, for example, the first sub-bounding box 4031 between the "example title" and the "left-hand side column text-example" shown in FIG. 4A. The second bounding box may also include, for example, a second sub-bounding box 4033 on both sides of "left-hand side column text example" and a second sub-bounding box 4032 on both sides of "right-hand side column text example" in fig. 4A.

As shown in fig. 4A, a third enclosure box 406 may be generated above the first enclosure box 4021 in the document page 401, or a third enclosure box 406 may be generated below the first enclosure box 4022 in the document page 401.

As shown in fig. 4A, the first bounding box labeled "example of right-side column letters" has an overlapping area with both the first sub-bounding box 4031 and the second sub-bounding box 4033. The overlapping area may be removed from the first and second

sub-surrounding frames

4031 and 4033, resulting in the adjusted first and second sub-surrounding frames 4031 'and 4033' in fig. 4B, for example. In a similar manner, each overlapping region may be removed from at least one second bounding box corresponding to the overlapping region, resulting in other adjusted second bounding boxes.

A merge operation may be performed on the adjacent second bounding boxes resulting in a plurality of candidate bounding boxes. In the course of performing the merge operation, the second sub-surrounding frame 4035, the adjusted second sub-surrounding frame 4033' is adjacent to the first sub-surrounding frame 4034. Adjacent second bounding boxes having a width difference within a preset difference range may be merged. For example, in the second sub-surrounding frame 4035 and the adjusted second sub-surrounding frame 4033', the difference between the widths of the first sub-surrounding frame 4034 and the second sub-surrounding frame 4035 is small, and the first sub-surrounding frame 4034 and the second sub-surrounding frame 4035 may be merged within a preset difference range.

In one example, a divide operation will be performed on the first sub-bounding box 4034 resulting in a third sub-bounding box equal in width to the second sub-bounding box 4035. In a similar manner, a partition operation is performed on the first sub-bounding box 4036, followed by a merge operation, resulting in a candidate bounding box 4043, e.g., in fig. 4C.

The plurality of candidate bounding boxes may include, for example, candidate bounding box 4041, candidate bounding box 4042, candidate bounding box 4043, and candidate bounding box 4044 in fig. 4C. For a plurality of candidate bounding boxes, a plurality of partial images of the document page 401 may be determined from the position information of each candidate bounding box. From the contents of the plurality of partial images, a target image may be determined, for example, one partial image determined from the candidate bounding box 4043 may be determined as the target image. The target image may be, for example, the line graph 405 in fig. 4C.

Note that, for example, the width of the second enclosure frame in fig. 3A to 3B and 4A to 4C may be the length from the edge of the first enclosure frame to the edge of the document page, but the width of the second enclosure frame is reduced to some extent in order to clearly distinguish the edge of the first enclosure frame, the edge of the second enclosure frame, and the edge of the document page in the drawings.

Fig. 5 is a flowchart of a data extraction method for an image according to one embodiment of the present disclosure.

As shown in fig. 5, the data extraction method 500 for an image may include operations S510 to S530.

In operation S510, coordinates of N marker points located on a coordinate axis in the target image are determined according to a pixel value of each pixel within the target image.

For example, the target image may be, for example, the line graph 405 in fig. 4C.

For example, the markers may be tick mark ends on a coordinate axis.

In the embodiment of the present disclosure, pixel analysis may be performed on the target image to obtain a pixel value of each pixel.

For example, according to the result of the pixel analysis, it can be obtained that the target image has a plurality of continuous pixels with the same pixel value, and further, a plurality of horizontal line segments and vertical line segments in the target image can be determined, wherein the line segments with longer lengths and mutually perpendicular lengths are taken as coordinate axes, for example, the longest horizontal line segment is taken as a horizontal axis, and the longest vertical line segment is taken as a vertical axis. The origin is then determined from the coordinate axes, and in one example, the intersection of the horizontal and vertical axes may be used as the origin.

In the disclosed embodiment, the coordinate axis includes M pixels.

For example, the horizontal axis includes M pixels.

In the embodiment of the present disclosure, M pixels of each line of pixels in K lines of pixels closest to the coordinate axis may be obtained; k is more than or equal to 1.

For example, the K rows of pixels closest to the horizontal axis may be above the horizontal axis or below the horizontal axis. For another example, the K rows of pixels closest to the coordinate axis may be on the left side of the vertical axis or on the right side of the vertical axis.

For example, K is 2, and M pixels of the first row of pixels and M pixels of the second row of pixels in the 2 rows of pixels closest to the horizontal axis may be acquired. The first row of pixels is closest to the horizontal axis and the second row of pixels is next to the horizontal axis.

In the embodiment of the present disclosure, the similarity between the jth pixel of the coordinate axis and the jth pixel of the first row of pixels is calculated, and the similarity between the jth pixel of the coordinate axis and the jth pixel of the second row of pixels is calculated.

For example, in response to that the similarity between the jth pixel on the coordinate axis and the jth pixel in each row of pixels is greater than a preset similarity threshold, the similarity between the jth pixel on the coordinate axis and the jth-1 pixel in each row of pixels is less than a preset similarity threshold, and the similarity between the jth + pixel on the coordinate axis and the jth + lambda pixel in each row of pixels is less than a preset similarity threshold, determining the jth pixel on the coordinate axis as a mark point; j is 2, … …, M; λ is a preset value, and λ is a natural number. In one example, the coordinate axis may be a horizontal axis. In another example, λ may take 1. In another example, the preset similarity threshold may be 50%. In another example, the tick mark is wider and λ may be 3.

In operation S520, a dividing operation is performed on the target image according to the coordinates of the N marker points to obtain N +1 sub-regions.

For example, N vertical lines passing through N marking points on the horizontal axis are generated, resulting in N +1 sub-regions.

In operation S530, a text recognition operation is performed on the ith sub-region of the N +1 sub-regions, resulting in ith group of data corresponding to the ith sub-region.

In the disclosed embodiments, i 1.

For example, with 5 tick marks on the horizontal axis, 6 sub-regions can be obtained. Each sub-region corresponds to a value on the horizontal axis. After the text recognition operation is performed, data can be extracted from the sub-regions, and a corresponding relationship between the numerical value on the horizontal axis and the data extracted from the sub-regions is established.

The text recognition operation may be an OCR (optical character recognition) operation, or may be another operation. For example, pixel analysis may be used to determine the inflection point or end point of the line graph in the subregion, which in turn determines the distance of the inflection point or end point to the horizontal axis to determine the corresponding data for the inflection point or end point.

Through the embodiment of the disclosure, data can be accurately extracted from the column diagram or the line diagram. In particular, data can be accurately extracted from a bar graph or a line graph of a paper with two columns of characters.

Fig. 6A to 6B are schematic and schematic diagrams of a data extraction method for an image according to one embodiment of the present disclosure.

As shown in fig. 6A, the coordinates of N markers in the target image 601, for example, the coordinates of 5 markers in the target image, may be determined according to the pixel value of each pixel in the target image 601. One of the marked points is the end point of the first tick mark 605 on the horizontal axis 602.

For example, it may be obtained that the target image has a plurality of continuous pixels with the same pixel value, and then a plurality of horizontal line segments and vertical line segments in the target image may be determined, and a line segment with a longer length and perpendicular to each other may be taken as a coordinate axis, for example, the longest horizontal line segment may be taken as the horizontal axis 602, and the longest two vertical line segments may be taken as the first vertical axis 603 and the second vertical axis 604. As shown in fig. 6A, the horizontal axis 602 is the longest of the horizontal line segments, and the first and second

longitudinal axes

603 and 604 are the longest of the vertical line segments. The origin is determined by the coordinate axes, and the intersection of the horizontal axis 602 and the first vertical axis 603 may be used as the origin.

After determining the horizontal axis 602 from the target image 601, the scale lines on the horizontal axis may be determined by referring to the method described above with respect to operation S510. The end points of the graduation marks on the horizontal axis 602 may be used as mark points, resulting in 5 mark points.

As shown in fig. 6B, a dividing operation is performed on the target image according to the coordinates of the 5 marker points, resulting in 6 sub-regions. The 6 sub-regions correspond to 6 values on the horizontal axis, which are 2014, 2015, 2016, 2017, 2018 and 2019 respectively.

Then, the text recognition operation is performed on the 6 sub-regions, respectively, to obtain 6 data, respectively 127, 163, 283, 113, 189, and 435.

Finally, 6 sets of structured data (2014, 127), (2015, 163), (2016, 283), (2017, 113), (2018, 189), (2019, 435) can be extracted as a result of text recognition.

FIG. 7 is a block diagram of a document processing device according to one embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 may include a first generation module 710, a second generation module 720, a merging module 730, a first determination module 740, and a third generation module 750.

The first generating module 710 is configured to generate a plurality of first bounding boxes according to the position information of the line text image in the document page. In some embodiments, the first generating module 710 may be configured to perform the operation S210, which is not described herein again.

A second generating module 720, configured to generate a plurality of second bounding boxes according to the location information of the plurality of first bounding boxes, where each second bounding box is used to mark a text sparse region in the document page. In some embodiments, the second generating module 720 may be configured to perform the operation S220, which is not described herein again.

And a merging module 730, configured to perform a merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes. In some embodiments, the merging module 730 may be configured to perform the operation S230, which is not described herein again.

The first determining module 740 is configured to determine, for the candidate bounding boxes, a plurality of partial images of the document page according to the position information of each candidate bounding box. In some embodiments, the first determining module 740 may be configured to perform the operation S240, which is not described herein again.

A third generating module 750, configured to generate a target image according to the content in the plurality of partial images. In some embodiments, the third generating module 750 may be configured to perform the operation S250, which is not described herein again.

In some embodiments, the second enclosure frame includes a first sub-enclosure frame and a second sub-enclosure frame; the second generation module includes: a first generating unit, configured to generate a first sub-enclosure frame between any two of the first enclosure frames adjacent to each other up and down; and a second generating unit for generating a second sub-enclosure frame on the left side and/or the right side of each of the first enclosure frames.

In some embodiments, the candidate bounding box is a rectangle; the width of the second sub-surrounding frame is the length from the edge of the first surrounding frame to the edge of the document page; the merging module includes: the dividing unit is used for executing dividing operation on each first sub-surrounding frame to obtain a plurality of third sub-surrounding frames with the width equal to that of the second sub-surrounding frame; and the merging unit is used for executing merging operation on the third sub-surrounding frame and the second sub-surrounding frame to obtain a candidate surrounding frame.

In some embodiments, the second generating module includes: a first determining unit configured to determine, for each of the plurality of first bounding boxes, at least one overlapping area corresponding to at least one of the second bounding boxes, based on position information of the first bounding box and position information of the second bounding box; and a removing unit configured to remove, for at least one overlapping area, the overlapping area from at least one of the second bounding boxes corresponding to each overlapping area, respectively, to obtain a plurality of adjusted second bounding boxes.

In some embodiments, the first generating module is further configured to generate a plurality of first bounding boxes for each line of text in the document page according to the position information of each text image in the document page and the height of each text image.

Fig. 8 is a block diagram of a data extraction apparatus for an image according to another embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 may include a second determination module 810, a division module 820, and a text recognition module 830.

The second determining module 810 is configured to determine coordinates of N marker points located on a coordinate axis in the target image according to a pixel value of each pixel in the target image. In some embodiments, the second determining module 810 may be configured to perform the operation S510, which is not described herein again.

And a dividing module 820, configured to perform a dividing operation on the target image according to the coordinates of the N mark points, so as to obtain N +1 sub-areas. In some embodiments, the dividing module 820 may be configured to perform the operation S820, which is not described herein again.

A text recognition module 830, configured to perform a text recognition operation on the ith sub-region of the N +1 sub-regions to obtain ith data corresponding to the ith sub-region; 1.... N + 1. In some embodiments, the text recognition module 830 may be configured to perform the operation S530, which is not described herein again.

Wherein the above-mentioned target image is generated according to, for example, the document processing apparatus in fig. 7.

In some embodiments, the coordinate axis includes M pixels; the second determining module includes: the acquisition unit is used for acquiring M pixels of each line of K lines of pixels closest to the coordinate axis; k is more than or equal to 1; a second determining unit, configured to determine a jth pixel on the coordinate axis as a mark point in response to that a similarity between the jth pixel on the coordinate axis and a jth pixel in each row of pixels is greater than a preset similarity threshold, a similarity between the jth-1 pixel on the coordinate axis and a jth-1 pixel in each row of pixels is less than a preset similarity threshold, and a similarity between the jth + λ pixel on the coordinate axis and a jth + λ pixel in each row of pixels is less than a preset similarity threshold; j is 2, … …, M; λ is a preset value, and λ is a natural number.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a document processing method or a data extraction method for an image. For example, in some embodiments, the document processing method or the data extraction method for an image may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described document processing method or data extraction method for an image may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform a document processing method or a data extraction method for an image by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A document processing method, comprising:

generating a plurality of first enclosure frames according to the position information of the line character images in the document page;

generating a plurality of second enclosing frames according to the position information of the first enclosing frames, wherein each second enclosing frame is used for marking a text sparse region in the document page;

executing merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes;

determining a plurality of local images of the document page according to the position information of each candidate surrounding frame aiming at the candidate surrounding frames; and

and generating a target image according to the contents in the plurality of local images.

2. The method of claim 1, wherein the second bounding box comprises a first sub-bounding box and a second sub-bounding box;

the generating a plurality of second bounding boxes according to the position information of the plurality of first bounding boxes comprises:

generating a first sub-surrounding frame between any two adjacent first surrounding frames; and

generating a second sub-enclosure frame on the left and/or right side of each of the first enclosure frames.

3. The method of claim 2, wherein the candidate bounding box is a rectangle; the width of the second sub-surrounding frame is the length from the edge of the first surrounding frame to the edge of the document page;

the performing a merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes includes:

performing a dividing operation on each first sub-surrounding frame to obtain a plurality of third sub-surrounding frames with the width equal to that of the second sub-surrounding frame; and

and executing a merging operation on the third sub-surrounding frame and the second sub-surrounding frame to obtain a candidate surrounding frame.

4. The method of any of claims 1 to 3, wherein the generating a plurality of second bounding boxes according to the location information of the plurality of first bounding boxes comprises:

for each first enclosure frame in the plurality of first enclosure frames, determining at least one overlapping area according to the position information of the first enclosure frame and the position information of the second enclosure frame, wherein one overlapping area corresponds to at least one second enclosure frame; and

and aiming at least one overlapping area, respectively removing the overlapping area from at least one second surrounding frame corresponding to each overlapping area to obtain a plurality of adjusted second surrounding frames.

5. The method of claim 1, wherein the generating a plurality of first bounding boxes according to the position information of the line text image in the document page comprises:

and generating a plurality of first enclosure boxes aiming at each line of characters in the document page according to the position information of each character image in the document page and the height of each text image.

6. A data extraction method for an image, comprising:

determining coordinates of N mark points positioned on a coordinate axis in a target image according to the pixel value of each pixel in the target image;

according to the coordinates of the N marking points, dividing the target image to obtain N +1 sub-areas;

executing text recognition operation aiming at the ith sub-area in the N +1 sub-areas to obtain ith group of data corresponding to the ith sub-area; 1..., N + 1;

wherein the target image is generated according to the document processing method of any one of claims 1 to 5.

7. The method of claim 6, wherein the coordinate axes comprise M pixels;

the determining the coordinates of the N marker points located on the coordinate axis in the target image according to the pixel value of each pixel in the target image includes:

acquiring M pixels of each line of pixels in K lines of pixels closest to the coordinate axis; k is more than or equal to 1;

determining that the jth pixel on the coordinate axis is a mark point in response to that the similarity between the jth pixel on the coordinate axis and the jth pixel in each row of pixels is greater than a preset similarity threshold, the similarity between the jth-1 pixel on the coordinate axis and the jth-1 pixel in each row of pixels is less than a preset similarity threshold, and the similarity between the jth + lambda pixel on the coordinate axis and the jth + lambda pixel in each row of pixels is less than a preset similarity threshold; j is 2, … …, M; λ is a preset value, and λ is a natural number.

8. A document processing apparatus comprising:

the first generation module is used for generating a plurality of first enclosure frames according to the position information of the line character images in the document page;

the second generating module is used for generating a plurality of second enclosing frames according to the position information of the plurality of first enclosing frames, and each second enclosing frame is used for marking a text sparse region in the document page;

the merging module is used for executing merging operation on the adjacent second bounding boxes to obtain a plurality of candidate bounding boxes;

the first determining module is used for determining a plurality of local images of the document page according to the position information of each candidate surrounding frame aiming at the candidate surrounding frames; and

and the third generation module is used for generating a target image according to the contents in the plurality of local images.

9. The apparatus of claim 8, wherein the second bounding box comprises a first sub-bounding box and a second sub-bounding box;

the second generation module comprises:

the first generating unit is used for generating a first sub-surrounding frame between any two adjacent first surrounding frames up and down; and

and the second generating unit is used for generating a second sub-surrounding frame on the left side and/or the right side of each first surrounding frame.

10. The apparatus of claim 9, wherein the candidate bounding box is a rectangle; the width of the second sub-surrounding frame is the length from the edge of the first surrounding frame to the edge of the document page;

the merging module comprises:

the dividing unit is used for executing dividing operation on each first sub-surrounding frame to obtain a plurality of third sub-surrounding frames with the width equal to that of the second sub-surrounding frame; and

and the merging unit is used for executing merging operation on the third sub-surrounding frame and the second sub-surrounding frame to obtain a candidate surrounding frame.

11. The apparatus of any of claims 8 to 10, wherein the second generating means comprises:

a first determining unit, configured to determine, for each of the plurality of first bounding boxes, at least one overlapping area according to position information of the first bounding box and position information of the second bounding box, where one overlapping area corresponds to at least one of the second bounding boxes; and

and the removing unit is used for removing the overlapping area from at least one second surrounding frame corresponding to each overlapping area respectively aiming at the at least one overlapping area to obtain a plurality of adjusted second surrounding frames.

12. The apparatus of claim 8, wherein the first generating module is further configured to generate a plurality of first bounding boxes for each row of text in the document page according to the position information of each text image in the document page and the height of each text image.

13. A data extraction apparatus for an image, comprising:

the second determining module is used for determining the coordinates of N marking points on a coordinate axis in the target image according to the pixel value of each pixel in the target image;

the dividing module is used for executing dividing operation on the target image according to the coordinates of the N marking points to obtain N +1 sub-areas;

the text recognition module is used for executing text recognition operation on the ith sub-area in the N +1 sub-areas to obtain ith group of data corresponding to the ith sub-area; 1..., N + 1;

wherein the target image is generated by the document processing apparatus according to any one of claims 8 to 12.

14. The apparatus of claim 13, wherein the coordinate axes comprise M pixels;

the second determining module includes:

the acquisition unit is used for acquiring M pixels of each line of K lines of pixels closest to the coordinate axis; k is more than or equal to 1;

a second determining unit, configured to determine that a jth pixel on the coordinate axis is a mark point in response to that a similarity between the jth pixel on the coordinate axis and a jth pixel in each row of pixels is greater than a preset similarity threshold, a similarity between a jth-1 pixel on the coordinate axis and a jth-1 pixel in each row of pixels is less than a preset similarity threshold, and a similarity between a jth + λ pixel on the coordinate axis and a jth + λ pixel in each row of pixels is less than a preset similarity threshold;

j is 2, … …, M; λ is a preset value, and λ is a natural number.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.