CN113343866A

CN113343866A - Identification method and device of form information and electronic equipment

Info

Publication number: CN113343866A
Application number: CN202110660616.1A
Authority: CN
Inventors: 雷卓
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-09-03

Abstract

The embodiment of the application provides a method and a device for identifying table information and electronic equipment. The method comprises the following steps: identifying all table cells in the table image by using a character identification algorithm; generating a table layout matrix based on the position information of the table cells; mapping the table layout matrix to the table image, and acquiring a sub-image defined by each table cell after being mapped to the table image; and recognizing the text information in each sub-image by using an optical character recognition algorithm.

Description

Identification method and device of form information and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for identifying table information and electronic equipment.

Background

With the continuous deepening of the energy-saving and environment-friendly concept, the paper documents are gradually replaced by the electronic documents. The existing paper documents are gradually converted into electronic documents through some technical means. For example, a paper document is scanned by a scanning device to generate a picture-type electronic document.

Since the electronic document of the picture type cannot be accessed to the existing database, it is also necessary to extract valid data from the picture by the image recognition technology and then store the extracted data in the database.

However, the picture-type electronic document, particularly the picture-type form document (hereinafter referred to as a form image), has a problem that form information cannot be effectively recognized.

Therefore, it is desirable to provide a table information identification scheme with high accuracy.

Disclosure of Invention

The embodiment of the specification provides a method and a device for identifying table information, and an electronic device:

according to a first aspect of embodiments of the present specification, there is provided a method for identifying table information, the method including:

identifying all table cells in the table image by using a character identification algorithm;

generating a table layout matrix based on the position information of the table cells;

mapping the table layout matrix to the table image, and acquiring a sub-image defined by each table cell after being mapped to the table image;

and recognizing the text information in each sub-image by using an optical character recognition algorithm.

According to a second aspect of embodiments of the present specification, there is provided an apparatus for identifying table information, the apparatus including:

the structure identification unit identifies all table cells in the table image by using a character identification algorithm;

a matrix generation unit that generates a table layout matrix based on the position information of the table cells;

the image acquisition unit is used for mapping the table layout matrix into the table image and acquiring a sub-image defined by each table cell after being mapped in the table image;

and the character recognition unit recognizes the character information in each sub-image by using an optical character recognition algorithm.

According to a third aspect of embodiments herein, there is provided an electronic apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured as the identification method of any one of the above table information.

The embodiment of the specification provides a scheme for identifying table information, wherein table cells in a table image are identified by a character identification algorithm originally used for character detection; mapping the table layout matrix based on the table cells to a table image to obtain sub-images defined by each table cell; and finally, recognizing the character information in each sub-image by using an optical character recognition technology with small calculation amount. In this way, the table cells have independent characteristics, and the sub-images independent of each table cell are intercepted to respectively identify; not only can overcome the mutual influence between the adjacent table cells, but also improve the identification accuracy of the table information; and compared with the method of directly carrying out overall character information recognition on the table image, the total calculation amount is smaller.

Drawings

Fig. 1 is a flowchart of a method for identifying table information according to an embodiment of the present disclosure;

FIG. 2a is a schematic diagram of a form image provided in one embodiment of the present description;

fig. 2b is a schematic diagram of a binary image provided in an embodiment of the present specification;

fig. 2c is a schematic diagram of a binary image with text removed according to an embodiment of the present disclosure;

fig. 2d is a schematic diagram of a dilated binary image after dilation transformation provided in an embodiment of the present specification;

FIG. 3 is a diagram illustrating a process for identifying rows and columns of a table according to an embodiment of the present disclosure;

fig. 4 is a hardware configuration diagram of an apparatus for identifying table information provided in an embodiment of the present specification;

fig. 5 is a block diagram of an apparatus for identifying table information according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the related art, when recognizing a form image, ORC (Optical Character Recognition) is generally used to directly recognize the entire form image. However, because the table image not only contains the text information but also contains the frame of the table cell, the existence of the frame of the table cell can influence the recognition of the text information in the table cell; and the adjacent table cells can influence each other, so that the identification accuracy is not high. It can be seen that the above related art cannot accurately identify table information, and even output structured table data.

The following description will be made by taking a table information identification method shown in fig. 1 as an example. The method can be applied to a device side for performing table identification, and the device can be a client side or a server side. The method may comprise the steps of:

step 110: all the table cells in the table image are identified using a character recognition algorithm.

In this embodiment, a conventional character recognition algorithm is applied to frame recognition of a table cell in a table image. In practice, step 110 may be implemented using a word recognition algorithm in OpenCV.

The OpenCV is a BSD license (open source) based distributed cross-platform computer vision and machine learning software library that can run on Linux, Windows, Android, and Mac OS operating systems. The method is light and efficient, is composed of a series of C functions and a small number of C + + classes, provides interfaces of languages such as Python, Ruby, MATLAB and the like, and realizes a plurality of general algorithms in the aspects of image processing and computer vision.

Specifically, step 110 may include the following A1-A4:

step A1: and carrying out binarization processing on the table image by using a binarization algorithm to obtain a binary image.

The Binarization processing is also called Image Binarization, and is a process of setting the gray value of a pixel point on an Image to be 0 (black) or 255 (white), that is, the whole Image presents an obvious black-and-white effect. By the binarization processing, the data amount in the original table image can be reduced, and the outline of each table cell in the table image can be highlighted.

Specifically, the input form image may be subjected to binarization processing based on a binarization algorithm (such as threshold) in OpenCV, and a binary image may be output.

The binarization algorithm can traverse the pixel value of each pixel point in the table image, if the pixel value is greater than a preset pixel threshold value, the pixel point is set to be a gray value of 255, and if the pixel value is less than the preset pixel threshold value, the pixel point is set to be a gray value of 0.

Reference is now made to the schematic diagram of the form image shown in fig. 2a, and the schematic diagram of the binary image shown in fig. 2 b. Comparing the change process of the table image in fig. 2a with the change process of the binary image in fig. 2b, it can be seen that the pixel values of the table cells and the table characters in the table image are changed from 0 to 255, and the pixel values of the blank regions in the table image are changed from 255 to 0.

Step A2: and removing the text information in the binary image by using a smearing transformation algorithm.

Determining the region of the character information in the binary image; and replacing the pixel value of the area where the text information is located with the pixel value of the adjacent area.

Based on a smearing conversion algorithm in OpenCV, finding out the position of characters in the table, and temporarily smearing the positions; by erasing the characters in the binary image, the influence of the characters on the identification of the table cell frame is avoided.

Referring now to the schematic diagram of fig. 2c of the binary image with text removed, the binary image in fig. 2c has removed text in the table compared to fig. 2 b.

In an embodiment, before step a2, the method further includes:

optimizing the binary image by using a Gaussian fuzzy algorithm and/or a self-adaptive threshold algorithm to obtain an optimized binary image;

correspondingly, in the step a2, removing text information in the binary image includes:

and removing the text information in the optimized binary image.

In this embodiment, the binary image may be optimized based on a gaussian fuzzy algorithm (gaussian smoothing) and an adaptive threshold algorithm (adaptive threshold) in OpenCV. The optimized image can not only further reduce the data volume of the binary image, but also correct the abnormal binary result.

Among them, gaussian blur is a linear smooth filter, which has a good effect of removing gaussian noise in an image.

The adaptive threshold is used for dynamically adjusting the pixel threshold during the binarization processing so as to improve the accuracy of the binarization.

Because the binarization algorithm threshold has a certain limitation, if an image has an obvious region with different brightness, when a partial region is dark, the region becomes completely black after binarization, and all details are lost. For this purpose, this problem can be solved by an adaptive threshold algorithm.

The adaptiveThreshold method is described below:

advtivethreshold (src, maxValue, adaptivethoid, threshold type, blockSize, C, dst ═ None); the principle of the method is to take an N-N area by taking each pixel point of an image as a center, and then calculate the threshold value of the area, thereby determining whether the pixel point is changed into 0 or 255.

In the method, src represents a binary image;

maxValue represents the gray value to be set by the pixel point meeting the condition;

adaptive method denotes the adaptive threshold algorithm. Wherein, the ADAPTIVE threshold algorithm may include ADAPTIVE _ THRESH _ MEAN _ C (which is the average of the local neighborhood blocks, and the algorithm is to find the average in the blocks first) or ADAPTIVE _ THRESH _ GAUSSIAN _ C (which is the GAUSSIAN weighted sum of the local neighborhood blocks, and the algorithm is to perform weighted calculation on the pixels around (x, y) in the region according to their distance from the center point according to the GAUSSIAN function);

threshold type represents the binarization algorithm in OpenCV. Wherein, the binarization algorithm may include THRESH _ BINARY or THRESH _ BINARY _ INV.

blockSize indicates the size of an area with a pixel point as a center, namely the N value is generally an odd number; when blockSize is larger, the area participating in the threshold calculation is also larger, the detailed contour becomes less, and the overall contour becomes thicker and more obvious.

C is a constant, which is subtracted from the threshold calculated for each region as the final threshold for that region, which may be negative. When C is larger, the threshold calculated by the N × N neighborhood of each pixel point is smaller, the probability that the central point is larger than the threshold is larger, the probability set to be 255 is larger, the number of white pixels of the whole image is larger, and vice versa.

Dst denotes an output image.

Step A3: and performing linear detection on the binary image without the character information by using a linear detection algorithm to determine the frame line of the table cell.

The border lines in the binary image are detected based on a straight line detection (HoughLines) algorithm in OpenCV, and the border lines can be highlighted by extending the border lines.

In an embodiment, before step a3, the method further includes:

performing expansion transformation on the binary image without the character information to obtain a binary image with a thickened frame of the table cell;

carrying out corrosion transformation on the binary image with the thickened frame to obtain a binary image of the original frame with the thickness;

correspondingly, in step a3, performing straight line detection on the binary image without the text information, and determining the frame line of the table cell includes:

and carrying out straight line detection on the corrosion-transformed binary image, and determining the frame line of the table cell.

In this embodiment, the method can be implemented based on the dilation transform (dilate) and erosion transform (erode) in OpenCV.

Referring to the schematic diagram of the dilated binary image after dilation transformation shown in fig. 2d, compared to fig. 2c, the frame of the table cell in the binary image is thickened, and the thickened frame may cover noise pixels near the original frame, such as possibly existing text residual pixel points. Then, the frame is changed back to the original frame through corrosion transformation.

Because the thickness of the frame after corrosion transformation is relatively unchanged before expansion, noise pixel points near the frame line can be removed by once expansion corrosion, so that the frame line can be identified more accurately when the step A3 is used for straight line detection.

Step A4: determining the position information of all the table cells based on the frame lines by utilizing a contour detection algorithm; the position information is coordinate information of four frame lines of the table cell.

Based on the outline detection (findContours) algorithm in OpenCV, location information for all table cells is determined based on the bounding box lines.

Step 120: and generating a table layout matrix based on the position information of the table cells.

Step 130: and mapping the table layout matrix into the table image, and acquiring a sub-image defined by each table cell after being mapped in the table image.

A table layout matrix corresponding to the table cell layout is generated from the coordinate information of the four border lines of the table cell identified in step 110.

The table layout matrix is then mapped into the table image so that the sub-images defined after each table cell is mapped to the table image can be truncated.

Step 140: and recognizing the text information in each sub-image by using an optical character recognition algorithm.

And recognizing the text information in each sub-image by using an optical character recognition technology with small calculation amount. In this way, the table cells have independent characteristics, and the sub-images independent of each table cell are intercepted to respectively identify; not only can overcome the mutual influence between the adjacent table cells, but also improve the identification accuracy of the table information; and compared with the method of directly carrying out overall character information recognition on the table image, the total calculation amount is smaller.

In an embodiment, before step 120, the method may further include:

identifying the position information of all rows and columns in the tabular image by using an image description model;

the image description is a technology for automatically generating corresponding descriptive characters according to an input image in computer vision. In this embodiment, the image description technique is applied to the identification of the row and column structure of the table image. A sequence of hypertext markup language (HTML) tags is generated by inputting a table image to represent the arrangement of rows and columns in the table.

In the schematic diagram of the table row and column identification process shown in fig. 3, the image description model may output the right HTML tag sequence by calculation for the left input table image.

In an embodiment, the identifying the position information of all rows and columns in the table image by using the image description model may include:

extracting image features in the form image based on a feature extraction algorithm in the image description model;

generating a sequence of hypertext markup language tags for table rows and columns in the table image based on an encoder of the image description model;

and calculating the image characteristics and the hypertext markup language label sequence based on the decryptor of the image description model to generate position information of a line and column structure.

The feature extraction algorithm may adopt a neural network algorithm, such as a convolutional neural network CNN. The encoder is used to encode HTML sequence tags, and may employ gated loop units GRU. The decoder is used for predicting a row-column structure based on a feature extraction algorithm and an encoder output result, and a gated cyclic unit GRU can also be adopted.

The model training process of the image description model is introduced as follows:

the original HTML tag code may be divided into several identified results. The input sample of the model may be the HTML tag sequence of the form and the form image corresponding thereto, with the output tag being the next result of the identification of its HTML language sequence. The model uses cross entropy as a loss function by comparing actual and predicted identification results.

The form image may be propagated forward through the CNN, with text processing beginning with a start character. The generated identification result can be added to the final output sequence in each step and entered into the model as a new input. And repeating the steps until an end character is generated or the maximum length upper limit of the sequence is reached.

In the implementation process, after the table cells and the rows and the columns are identified, a table layout matrix may be generated based on the position information of the table cells and the position information of the rows and the columns.

Fusing cell information and row and column structure information: generating a first matrix of table layout according to the position information of the cells; generating a second matrix of the table layout according to the position information of the rows and the columns; and finally, fusing the first matrix of the table layout and the second matrix of the table layout to obtain a final table layout matrix.

The fusion may adopt a weighted average manner, that is, the first matrix of table layout and the second matrix of table layout are weighted averaged to obtain a fused table layout matrix.

As previously described, the table layout matrix may further be mapped into the table image such that sub-images defined after each table cell is mapped into the table image may be truncated and textual information in each sub-image identified using an optical character recognition algorithm.

Through the embodiment, the specific independent characteristics of the table cells are utilized, and the mutual influence between the adjacent table cells is overcome; and the characteristic of strong integrity of the row-column structure is utilized, so that the problem of incomplete cell identification is solved.

After the textual information in each sub-image is identified, structured data containing the identified textual information in each sub-image may be output, which may include an excel file or a json file.

The form image in the embodiments of the present specification may refer to a framed form image, an input form image, and output structured data.

Corresponding to the embodiment of the identification method of the table information, the specification also provides an embodiment of an identification device of the table information. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer business program instructions in the nonvolatile memory into the memory for operation through the processor of the device in which the device is located. In terms of hardware, as shown in fig. 4, the hardware structure diagram of the device for identifying table information in this specification is shown, except for the processor, the network interface, the memory and the nonvolatile memory shown in fig. 4, the device where the apparatus is located in the embodiment may generally include other hardware according to the actual function of identification of the table information, which is not described again.

Referring to fig. 5, a block diagram of an apparatus for identifying table information provided in an embodiment of the present disclosure, the apparatus corresponding to the embodiment shown in fig. 1, the apparatus includes:

Optionally, the structure recognition unit includes:

a binarization subunit, which performs binarization processing on the form image by using a binarization algorithm to obtain a binary image;

the smearing subunit removes the character information in the binary image by using a smearing transformation algorithm;

the straight line detection subunit performs straight line detection on the binary image without the character information by using a straight line detection algorithm to determine a frame line of the table cell;

a contour detection subunit which determines the position information of all the table cells based on the frame lines by using a contour detection algorithm; the position information is coordinate information of four frame lines of the table cell.

Optionally, in the smearing subunit, removing the text information in the binary image includes:

and determining the area where the character information in the binary image is located, and replacing the pixel value of the area where the character information is located with the pixel value of the adjacent area.

Optionally, the apparatus further comprises:

the optimization processing subunit is used for optimizing the binary image by utilizing a Gaussian fuzzy algorithm and/or a self-adaptive threshold algorithm to obtain an optimized binary image;

correspondingly, in the smearing subunit, removing the text information in the binary image includes:

and removing the text information in the optimized binary image.

Optionally, the apparatus further comprises:

the expansion subunit is used for performing expansion transformation on the binary image without the character information to obtain a binary image with a thickened frame of the table cell;

the etching subunit is used for performing etching transformation on the binary image with the thickened frame to obtain a binary image of the original frame with the thickness;

correspondingly, in the line detection subunit, performing line detection on the binary image without the text information, and determining the frame line of the table cell, includes:

Optionally, the structure recognition unit further includes:

the matrix generation unit includes:

and generating a table layout matrix based on the position information of the table cells and the position information of the rows and the columns.

Optionally, the identifying, by using an image description algorithm, the position information of all rows and columns in the form image includes:

extracting image features in the form image based on a feature extraction algorithm (CNN) in the image description model;

Optionally, the generating a table layout matrix based on the position information of the table cells and the position information of the rows and the columns includes:

generating a first matrix of table layout based on the position information of the table cells;

generating a second matrix of table layout based on the position information of the rows and the columns;

and fusing the first matrix of the table layout and the second matrix of the table layout to obtain a final table layout matrix.

Optionally, the fusing the first matrix of table layout and the second matrix of table layout includes:

and carrying out weighted average on the first matrix of the table layout and the second matrix of the table layout.

Optionally, the apparatus further comprises:

and an output unit which outputs the structured data containing the recognized text information in each sub-image.

Optionally, the form image includes a frame form image.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

Fig. 5 above describes the internal functional modules and the structural schematic of the identification apparatus of table information, and the actual execution subject may be an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the method for identifying table results according to any of the preceding embodiments.

In the above embodiments of the electronic device, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiment of the electronic device, since it is substantially similar to the embodiment of the method, the description is simple, and for the relevant points, reference may be made to part of the description of the embodiment of the method.

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

Claims

1. A method for identifying table information, the method comprising:

2. The method of claim 1, wherein identifying all table cells in the table image based on a text recognition algorithm comprises:

carrying out binarization processing on the form image by using a binarization algorithm to obtain a binary image;

removing the character information in the binary image by using a smearing transformation algorithm;

performing linear detection on the binary image without the character information by using a linear detection algorithm to determine a frame line of the table cell;

determining the position information of all the table cells based on the frame lines by utilizing a contour detection algorithm; the position information is coordinate information of four frame lines of the table cell.

3. The method of claim 2, wherein the removing text from the binary image comprises:

determining the region of the character information in the binary image;

and replacing the pixel value of the area where the text information is located with the pixel value of the adjacent area.

4. The method of claim 2, further comprising:

correspondingly, the removing of the text information in the binary image includes:

and removing the text information in the optimized binary image.

5. The method of claim 2, further comprising:

correspondingly, the performing the straight line detection on the binary image without the text information and determining the frame line of the table cell includes:

6. The method of claim 1, further comprising:

generating a table layout matrix based on the location information of the table cells, including:

7. The method of claim 6, wherein identifying the location information of all columns and rows in the tabular image using an image description algorithm comprises:

8. The method of claim 6, wherein generating a table layout matrix based on the location information of the table cells and the location information of the rows and columns comprises:

9. The method of claim 8, wherein fusing the table layout first matrix and table layout second matrix comprises:

10. The method of claim 1, further comprising:

outputting structured data containing the textual information in each of the identified sub-images.

11. The method of claim 1, wherein the form image comprises a boxed form image.

12. An apparatus for identifying table information, the apparatus comprising:

13. The apparatus of claim 12, wherein the structure recognition unit comprises:

14. The apparatus according to claim 13, wherein in the daubing subunit, removing the text information in the binary image comprises:

15. The apparatus of claim 13, further comprising:

and removing the text information in the optimized binary image.

16. The apparatus of claim 13, further comprising:

17. The apparatus of claim 12, wherein the structure recognition unit further comprises:

the matrix generation unit includes:

18. The apparatus of claim 17, wherein the identifying the location information of all columns and rows in the tabular image using an image description algorithm comprises:

19. The apparatus of claim 17, wherein generating a table layout matrix based on the location information of the table cells and the location information of the rows and columns comprises:

20. The apparatus of claim 19, wherein the fusing the table layout first matrix and table layout second matrix comprises:

21. The apparatus of claim 12, further comprising:

22. The apparatus of claim 12, wherein the form image comprises a boxed form image.

23. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured as the method of any of the preceding claims 1-11.