CN111414919B

CN111414919B - Method, device, equipment and storage medium for extracting text of printed body picture with table

Info

Publication number: CN111414919B
Application number: CN202010225345.2A
Authority: CN
Inventors: 李佳; 杨阳; 刘旭东
Original assignee: Guangzhou Juying Information Technology Co ltd
Current assignee: Guangzhou Juying Information Technology Co ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2023-12-12
Anticipated expiration: 2040-03-26
Also published as: CN111414919A

Abstract

The application discloses a method, a device, equipment and a storage medium for extracting text of a printed body picture with a table, wherein the method for extracting text of the printed body picture with the table comprises the following steps: removing transverse lines and/or vertical lines with projection integral smaller than a first preset interval threshold value from the transverse lines and/or the vertical lines and obtaining a table of the binarized picture; deleting the table of the binarized picture and reserving the text content in the binarized picture. According to the method and the device, after the table picture in the picture is extracted, the interference line in the table picture is removed, so that the text of the picture is accurately extracted based on the table with the interference line removed.

Description

Method, device, equipment and storage medium for extracting text of printed body picture with table

Technical Field

The present application relates to the field of text recognition, and in particular, to a method, an apparatus, a device, and a storage medium for extracting text from a printed body picture with a table.

Background

In general, for the character recognition of the printed matter, firstly, character extraction is performed, then, a certain number of character pictures are provided with tables, and the tables are removed before the character of the printed matter is extracted, and then, the characters can be extracted for the next character recognition preparation. Therefore, there is a need to provide an accurate and efficient technical method for removing the form in the picture, thereby achieving text extraction.

In the prior application of removing the form of the text and the picture, the basic flow is to gray and binarize the original picture, and then to extract the horizontal line and the vertical line by the corrosion and expansion algorithm, thereby realizing the form extraction. However, there are two requirements in this prior art, namely, the length of the structural element needs to be fixed in the process of corrosion and expansion, and the length often has a default value, and needs to be adjusted according to different pictures to achieve the best effect; secondly, the horizontal lines and the vertical lines which need to be formed are connected by strokes of characters sometimes, so that an interference short line is formed instead of an actual horizontal line, and the recognition accuracy of the conventional character recognition technology is low due to the two defects.

Disclosure of Invention

The application aims to disclose a method, a device, equipment and a storage medium for extracting characters of a printed picture with a table, which are used for removing interference lines in the table picture after the table picture in the picture is extracted, so that the characters of the picture are accurately extracted based on the table with the interference lines removed.

The first aspect of the application discloses a method for extracting text of a printed body picture with a table, which comprises the following steps:

acquiring a picture to be processed, wherein the picture to be processed comprises a table;

calculating the gray value of each pixel point of the picture to be processed according to the RGB value of each pixel point;

sequentially comparing the gray value of each pixel point with a preset threshold value, and converting the gray value of each pixel point into 0 or 255 according to a comparison result so as to convert the picture to be processed into a binarized picture;

identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarized picture according to structural elements, a corrosion algorithm and an expansion algorithm;

calculating the horizontal projection integral of each transverse line and/or the horizontal projection integral of each vertical line;

removing transverse lines and/or vertical lines with projection integral smaller than a first preset interval threshold value from the transverse lines and/or the vertical lines and obtaining a table of the binarized picture;

and deleting the table of the binarized picture and reserving the text content in the binarized picture.

The method of the first aspect of the application can remove the interference lines in the form pictures extracted from the pictures, thereby accurately extracting the characters of the pictures based on the form with the interference lines removed, and compared with the prior art, the method has better recognition accuracy.

As an alternative embodiment, after said calculating the horizontal projection integral of each of said horizontal lines and/or the horizontal projection integral of each of said vertical lines, said method further comprises, before said removing from said plurality of horizontal lines and/or plurality of vertical lines horizontal lines having a projection integral less than a first preset pitch threshold and/or vertical lines having a projection integral less than a second preset pitch threshold and obtaining a table of said binarized picture:

acquiring horizontal projection points of the text content of the binarized picture in the row direction and horizontal projection points of the text content in the column direction;

taking the magnitude of horizontal projection integral of the text content in the row direction as the first preset interval threshold;

and taking the horizontal projection integral of the text content in the column direction as the second preset interval threshold value.

As an alternative embodiment, the identifying a plurality of horizontal lines and/or a plurality of vertical lines in the binary image according to a structural element, a corrosion algorithm and an expansion algorithm includes:

determining the structural elements according to the number of rows and the number of columns of the binarized picture;

carrying out corrosion operation on the structural element and each image in the binarized picture to obtain a corrosion operation result;

performing expansion operation on the structural element and each pixel in the binarized picture to obtain an expansion operation result;

and identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarized picture according to the corrosion operation result and the expansion operation junction.

As an optional implementation manner, the calculation formula for determining the structural element according to the number of rows and the number of columns of the binarized picture is as follows:

s=cols// SCALE, or s=rows// SCALE;

wherein COLS represents the number of columns of the binarized picture, ROWS represents the number of columns of the binarized picture, and// meets the representation integer and removes the remainder;

and, scale=cols// d_col or scale=row// d_row, wherein d_col represents a column pitch of the binarized picture and d_row represents a ROW pitch of the binarized picture.

As an optional implementation manner, the calculation formula for calculating the gray value of each pixel point according to the RGB value of each pixel point of the picture to be processed is as follows:

H＝0.3*R+0.59*G+0.11*B；

wherein H represents a gray value of the pixel, and R, G, B is an R value, a G value, and a B value of RGB values of each pixel, respectively.

As an optional implementation manner, after calculating the gray value of each pixel point according to the RGB value of each pixel point of the picture to be processed, the sequentially comparing the gray value of each pixel point with a preset threshold value and converting the gray value of each pixel point to 0 or 255 according to the comparison result, so that before converting the picture to be processed into a binary picture, the method further includes:

calculating the preset threshold according to a calculation formula t=b-C;

wherein T represents the preset threshold, B represents a weighted average of pixels in the r×r region around the pixel, and C represents a difference value of pixels in the r×r region around the pixel.

As an optional implementation manner, the deleting the table of the binary image and retaining the text content in the binary image includes:

and subtracting the table of the binarized picture from the binarized picture to obtain the text content in the binarized picture, wherein the binarized picture is a reduction number, and the table of the binarized picture is a reduced number.

The second aspect of the application discloses a device for extracting the picture and the text of a printed body with a table, which comprises:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a picture to be processed, and the picture to be processed comprises a table;

the gray processing module is used for calculating the gray value of each pixel point of the picture to be processed according to the RGB value of each pixel point;

the binarization processing module is used for comparing the gray value of each pixel point with a preset threshold value in sequence and converting the gray value of each pixel point into 0 or 255 according to a comparison result so as to convert the picture to be processed into a binarization picture;

the identification module is used for identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarization picture according to the structural elements, the corrosion algorithm and the expansion algorithm;

the calculating module is used for calculating the horizontal projection integral of each transverse line and/or the horizontal projection integral of each vertical line;

the screening module is used for removing transverse lines with projection integral smaller than a first preset interval threshold value and/or vertical lines with projection integral smaller than a second preset interval threshold value from the transverse lines and/or the vertical lines and obtaining a table of the binarized picture;

and the deleting module is used for deleting the table of the binarized picture and reserving the text content in the binarized picture.

The device of the second aspect of the present application can remove the interference lines in the form pictures extracted from the pictures by executing the method of the first aspect of the present application, so that the text of the pictures can be accurately extracted based on the form with the interference lines removed, and compared with the prior art, the device has better recognition accuracy.

The third aspect of the application discloses a text extraction device for a tabular print, comprising:

a processor; and

a memory configured to store machine-readable instructions that when executed by the processor perform the tabulated print body picture text extraction method disclosed in the first aspect of the present application.

The device of the third aspect of the present application can remove the interference lines in the form pictures extracted from the pictures by executing the method of the first aspect of the present application, so that the text of the pictures can be accurately extracted based on the form with the interference lines removed, and compared with the prior art, the device has better recognition accuracy.

A fourth aspect of the present application discloses a storage medium storing a computer program which, when executed by a processor, performs the tabular print body picture text extraction method disclosed in the first aspect of the present application.

The storage medium of the fourth aspect of the present application can remove the interference lines in the form picture extracted from the picture by executing the method of the first aspect of the present application, so that the text of the picture can be accurately extracted based on the form from which the interference lines are removed, and the method has better recognition accuracy compared with the prior art.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for extracting text of a printed body picture with a table according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a text extraction device with a form print according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a text extraction device with a form print according to a third embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a method for extracting text of a printed body picture with a table according to an embodiment of the application. As shown in fig. 1, the method comprises the steps of:

101. acquiring a picture to be processed, wherein the picture to be processed comprises a table;

102. calculating the gray value of each pixel point according to the RGB value of each pixel point of the picture to be processed;

103. sequentially comparing the gray value of each pixel point with a preset threshold value, and converting the gray value of each pixel point into 0 or 255 according to a comparison result so as to convert the picture to be processed into a binarized picture;

104. identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarized picture according to the structural elements, the corrosion algorithm and the expansion algorithm;

105. calculating the horizontal projection integral of each horizontal line and/or the horizontal projection integral of each vertical line;

106. removing transverse lines and/or vertical lines with projection integral smaller than a first preset interval threshold value from the transverse lines and/or the vertical lines to obtain a table of a binarized picture;

107. and deleting the table of the binarized picture and reserving the text content in the binarized picture.

The method provided by the embodiment of the application can remove the interference lines in the table pictures extracted from the pictures, so that the characters of the pictures can be accurately extracted based on the table with the interference lines removed, and compared with the prior art, the method has better recognition accuracy.

As an alternative embodiment, after calculating the horizontal projection integral of each horizontal line and/or the horizontal projection integral of each vertical line, before removing the horizontal lines with the projection integral smaller than the first preset interval threshold and/or the vertical lines smaller than the second preset interval threshold from the horizontal lines and/or the vertical lines and obtaining the table of the binarized picture, the method further includes:

acquiring horizontal projection points of text contents of the binarized pictures in a row direction and horizontal projection points of the text contents in a column direction;

taking the measuring bit of the horizontal projection integral of the text content in the row direction as a first preset interval threshold;

and taking the horizontal projection integral of the text content in the column direction as a second preset interval threshold value.

As an alternative embodiment, identifying a number of horizontal lines and/or a number of vertical lines in the binary image according to the structural elements, the corrosion algorithm, and the expansion algorithm, comprises:

determining structural elements according to the number of rows and the number of columns of the binarized picture;

carrying out corrosion operation on the structural elements and each image in the binarized picture to obtain a corrosion operation result;

As an alternative implementation manner, the calculation formula for determining the structural element according to the number of rows and the number of columns of the binarized picture is as follows:

s=cols// SCALE, or s=rows// SCALE;

wherein COLS represents the number of columns of the binary image, ROWS represents the number of ROWS of the binary image, and// is in line with the representation integer, and the remainder is removed;

and, scale=cols// d_col or scale=row// d_row, where d_col represents the column spacing of the binarized picture and d_row represents the ROW spacing of the binarized picture.

As an alternative implementation manner, the calculation formula for calculating the gray value of each pixel point according to the RGB value of each pixel point of the picture to be processed is as follows:

H＝0.3*R+0.59*G+0.11*B；

wherein H represents the gray value of the pixel, and R, G, B is the R value, G value, and B value of the RGB values of each pixel.

As an alternative embodiment, after calculating the gray value of each pixel point according to the RGB value of each pixel point of the picture to be processed, the method further includes, before sequentially comparing the gray value of each pixel point with a preset threshold value and converting the gray value of each pixel point to 0 or 255 according to the comparison result, converting the picture to be processed into a binary picture:

calculating a preset threshold according to a calculation formula t=b-C;

wherein T represents a preset threshold, B represents a weighted average of pixels in an r×r region around the pixel point, and C represents a difference value of pixels in the r×r region around the pixel point.

As an optional implementation manner, deleting the table of the binarized picture and retaining the text content in the binarized picture includes:

and subtracting the table of the binarized picture from the binarized picture to obtain the text content in the binarized picture, wherein the binarized picture is the number of the reduction, and the table of the binarized picture is the number of the reduction.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a text extraction device with a form print according to an embodiment of the application. As shown in fig. 2, the apparatus includes:

an obtaining module 201, configured to obtain a to-be-processed picture, where the to-be-processed picture includes a table;

the gray processing module 202 is configured to calculate a gray value of each pixel point according to an RGB value of each pixel point of the image to be processed;

the binarization processing module 203 is configured to compare the gray value of each pixel with a preset threshold in sequence, and convert the gray value of each pixel to 0 or 255 according to the comparison result, so that the picture to be processed is converted into a binarized picture;

the identification module 204 is used for identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarized picture according to the structural elements, the corrosion algorithm and the expansion algorithm;

a calculation module 205, configured to calculate a horizontal projection integral of each horizontal line and/or a horizontal projection integral of each vertical line;

a screening module 206, configured to remove, from the plurality of horizontal lines and/or the plurality of vertical lines, horizontal lines and/or vertical lines with a projection integral smaller than a first preset pitch threshold and/or vertical lines smaller than a second preset pitch threshold, and obtain a table of the binarized picture;

the deleting module 207 is configured to delete the table of the binarized picture and retain text content in the binarized picture.

The device of the embodiment of the application can remove the interference lines in the table pictures extracted from the pictures by executing the method disclosed by the embodiment of the application, so that the characters of the pictures can be accurately extracted based on the table with the interference lines removed, and compared with the prior art, the device has better recognition accuracy.

Example III

Referring to fig. 3, fig. 3 is a schematic structural diagram of a text extraction device with a form print according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

a processor 302; and

the memory 301 is configured to store machine readable instructions that when executed by the processor 302 perform the tabulated print body picture text extraction method of the first embodiment of the present application.

The equipment of the embodiment of the application can remove the interference lines in the table pictures extracted from the pictures by executing the first method of the application, thereby being capable of accurately extracting the characters of the pictures based on the table with the interference lines removed, and having better recognition accuracy compared with the prior art.

Example IV

The embodiment of the application discloses a storage medium which stores a computer program, and when the computer program is executed by a processor, the method for extracting the characters of the printed body picture with the table disclosed in the first aspect of the application is executed.

The storage medium of the embodiment of the application can remove the interference lines in the table pictures extracted from the pictures by executing the method disclosed by the embodiment I of the application, so that the characters of the pictures can be accurately extracted based on the table with the interference lines removed, and compared with the prior art, the storage medium has better recognition accuracy.

In the several embodiments disclosed herein, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a positioning base station, or a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-On-y Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments of the present application are only examples, and are not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the technical scope of the present application, and the application should be covered. Therefore, the protection scope of the application is subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims

1. A method for extracting text of a tabular print, the method comprising:

the identifying a plurality of horizontal lines and/or a plurality of vertical lines in the binarized picture according to structural elements, a corrosion algorithm and an expansion algorithm comprises:

identifying a plurality of transverse lines and/or a plurality of vertical lines in the binarization picture according to the corrosion operation result and the expansion operation result;

acquiring horizontal projection points of the text content of the binarized picture in the row direction and horizontal projection points of the text content in the column direction; taking the measuring bit of the horizontal projection integral of the text content in the row direction as a first preset interval threshold; taking the horizontal projection integral of the text content in the column direction as a second preset interval threshold;

removing transverse lines and/or vertical lines with projection integral smaller than the first preset interval threshold value from the transverse lines and/or the vertical lines and obtaining a table of the binarized picture;

2. The method of claim 1, wherein the calculation formula for determining the structural element according to the number of rows and columns of the binarized picture is:

S＝COLS//SCALE ₁ or s=rows// SCALE ₂ ；

SCALE ₁ =cols// d_col or SCALE ₂ =row// d_row, where d_col represents the column spacing of the binarized picture and d_row represents the ROW spacing of the binarized picture.

3. The method of claim 1, wherein the calculating a gray value of each pixel of the picture to be processed according to the RGB value of each pixel is:

H＝0.3*R+0.59*G+0.11*B；

4. The method according to claim 1, wherein after said calculating the gradation value of each of the pixels from the RGB value of each of the pixels of the picture to be processed, said sequentially comparing the gradation value of each of the pixels with a preset threshold value and converting the gradation value of each of the pixels to 0 or 255 according to the comparison result, so that before converting the picture to be processed to a binary picture, the method further comprises:

calculating the preset threshold according to a calculation formula t=p-C;

wherein T represents the preset threshold, P represents a weighted average of pixels in an r×r region around the pixel point, and C represents a difference value of pixels in the r×r region around the pixel point.

5. The method of claim 1, wherein deleting the table of the binary picture and retaining text content in the binary picture comprises:

6. A tabular print text extraction apparatus, the apparatus comprising:

the screening module is used for removing transverse lines with projection integral smaller than the first preset interval threshold and/or vertical lines with projection integral smaller than the second preset interval threshold from the transverse lines and/or the vertical lines and obtaining a table of the binarized picture;

7. A tabulated print body picture text extraction apparatus, the apparatus comprising:

a processor; and

a memory configured to store machine-readable instructions that, when executed by the processor, perform the tabulated print body picture text extraction method of any one of claims 1-5.

8. A storage medium storing a computer program which, when executed by a processor, performs the tabulated print job text extraction method of any one of claims 1-5.