CN116958991A

CN116958991A - Image recognition method and device, electronic equipment and storage medium

Info

Publication number: CN116958991A
Application number: CN202310976322.9A
Authority: CN
Inventors: 黄依国; 王欢; 周骥; 冯歆鹏
Original assignee: NextVPU Shanghai Co Ltd
Current assignee: NextVPU Shanghai Co Ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-10-27

Abstract

Provided are an image recognition method and device, an electronic device and a storage medium. The image recognition method comprises the following steps: acquiring an image to be processed; text detection is carried out on the image to be processed so as to obtain one or more text boxes and first position information of the one or more text boxes; performing text recognition on the one or more text boxes to obtain one or more text messages contained in the one or more text boxes; determining whether a form area exists in the image to be processed based on the one or more text information; and in response to determining that a table region exists in the image to be processed: determining form rank information for each of the at least one first text box based on first location information for the at least one first text box within the form area of the one or more text boxes; and outputting structured data corresponding to the form area of the image to be processed based on the form rank information of each first text box and the text information contained in each first text box.

Description

Image recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image recognition, and in particular, to an image recognition method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

In many application scenarios, it is often necessary to obtain text information contained in an image. One common way is to manually extract and digitize the text information contained in the image. In addition, with the development of artificial intelligence technology, technologies for automatically recognizing images and extracting text information in images have also been rapidly developed, for example, recognizing, extracting and converting text information into a digitized data format by an image recognition technology.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

According to one aspect of the present disclosure, an image recognition method is provided. The image recognition method comprises the following steps: acquiring an image to be processed; text detection is carried out on the image to be processed so as to obtain one or more text boxes and first position information of the one or more text boxes; performing text recognition on the one or more text boxes to obtain one or more text messages contained in the one or more text boxes; determining whether a form area exists in the image to be processed based on the one or more text information; and in response to determining that a table region exists in the image to be processed: determining form information for each of the at least one first text box based on first location information for the at least one first text box located within the form area for the one or more text boxes; and outputting structured data corresponding to the form area of the image to be processed based on the form information of each first text box and the text information contained in each first text box.

According to another aspect of the present disclosure, an image recognition apparatus is provided. The image recognition apparatus includes: an acquisition module configured to acquire an image to be processed; the text detection module is configured to perform text detection on the image to be processed to obtain one or more text boxes and first position information of the one or more text boxes; a text recognition module configured to perform text recognition on the one or more text boxes to obtain one or more text information contained in the one or more text boxes; a determine form area module configured to determine whether a form area exists in the image to be processed based on the one or more text information; a determination form information module configured to determine form rank information for each of the at least one first text box based on first location information for the at least one first text box located within the form region of the one or more text boxes in response to determining that a form region exists in the image to be processed; and an output module configured to output structured data corresponding to the form area of the image to be processed based on the form rank information of each first text box and the text information contained in each first text box.

According to another aspect of the present disclosure, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the above-mentioned image recognition method.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described image recognition method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the above-described image recognition method.

Further features and advantages of the present disclosure will become apparent from the following description of exemplary embodiments, which is to be taken in conjunction with the accompanying drawings.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

Fig. 1 is a flowchart illustrating an image recognition method according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating another image recognition method according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating preprocessing an image to be processed according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating determining form information for text boxes within a form area in accordance with an illustrative embodiment;

FIG. 5 is a flowchart illustrating yet another image recognition method according to an exemplary embodiment;

FIG. 6 is a flowchart illustrating text correction of text information according to an exemplary embodiment;

fig. 7 is a block diagram showing the structure of an image recognition apparatus according to an exemplary embodiment;

fig. 8 is a block diagram of a computing device according to an exemplary embodiment of the present disclosure.

Detailed Description

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

The inventors have found that manually extracting and digitizing text information in an image results in high costs and an inefficient digitizing. When text information is identified and extracted by a common image identification technology, on one hand, the accuracy of text identification may not be high, and on the other hand, because the text information in an image is more in variety, especially in the case that an image to be processed contains a table, the data format of the identified text may be disordered, and structured data in the original image table cannot be accurately reserved.

To solve the above technical problems, the present disclosure provides an image recognition method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product. As will be apparent from the following detailed description, the image recognition method according to the embodiment of the present disclosure may perform recognition analysis on different types of regions in an image to be processed. In contrast to the above-described common image recognition techniques, particularly in the case where the image to be processed includes a form area, the image recognition method according to the present disclosure can reproduce a structured relationship between text information in the image while ensuring that the text information is correctly output. In addition, the image recognition method according to the present disclosure can also correct text information in a table area according to attribute information in the area, thereby improving text recognition accuracy.

Exemplary embodiments of the image recognition method of the present disclosure will be further described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of an image recognition method 100 according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the image recognition method 100 may include: step S110, obtaining an image to be processed; step S120, performing text detection on the image to be processed to obtain one or more text boxes and first position information of the one or more text boxes; step S130, performing text recognition on the one or more text boxes to obtain one or more text messages contained in the one or more text boxes; step S140, determining whether a form area exists in the image to be processed or not based on the one or more text information; and step S150, in response to determining that the table area exists in the image to be processed: determining table rank information corresponding to each of the at least one first text box based on first location information of the at least one first text box located within the table region of the one or more text boxes; and outputting structured data corresponding to the form area of the image to be processed based on the form rank information of each first text box and the text information contained in each first text box.

By identifying and analyzing different types of areas in the image to be processed and determining the table row and column information of the text box in the image to be processed under the condition that the image to be processed comprises the table area, the text information in the image to be processed can be more accurately output, and meanwhile, the structural relation among the text information in the table area can be reproduced.

In step 110, the stored or cached image to be processed may be read from an appropriate storage device (local and/or remote). Alternatively, the image to be processed may also be received from an external other device via a wired or wireless communication link. The scope of the presently claimed subject matter is not limited in this respect.

The image to be processed may be captured by a camera. The camera may be a stand-alone device (e.g., camera, video camera, etc.) or may be included in various types of electronic equipment (e.g., mobile phone, computer, personal digital assistant, tablet, wearable device, etc.). The camera may be an infrared camera or a visible light camera. The image to be processed may comprise a picture, photograph or other suitable form of image.

In step S120, text detection may be performed on the image to be processed, so as to locate one or more text boxes containing text in the image. A text box refers to a box-selected area that holds text, such as a rectangular text box or the like.

Text detection of the image to be processed may be performed by a text detection model, wherein the text detection model may be any suitable deep learning model capable of locating text boxes containing text in the image. For example, a deep learning model may be pre-trained with a large amount of image data as input and corresponding text boxes and their location information as output, and then the image to be processed is input into the deep learning model to generate one or more text boxes and their location information.

An example of a text detection model is a text detection micro-binarizable (Differentiable Binarization, DB) model with which a binarization threshold can be adaptively set in performing binarization in a segmentation network, thereby simplifying post-processing of converting a probability map into a binarized image and aggregating pixels into a text box, and improving the performance of text detection. Other examples of text detection models include, but are not limited to, text detection models based on a network of connected pre-boxes (Detecting Text in Natural Image with Connectionist Text Proposal Network, CTPN), efficient and accurate scene text detection pipelines (Efficient and Accuracy Scene Text detection pipeline, EAST), etc.

In some examples, the location information of the text boxes may refer to coordinate information of four vertices of each text box. In other examples, the location information of the text boxes may refer to a coordinate range of the area covered by each text box.

In step S130, text recognition may be performed on the text box in the image to be processed using a text recognition algorithm. Text recognition of a text box in an image to be processed may be performed, for example, by preprocessing the image, character segmentation, character recognition, and the like. For another example, text recognition of a text box in an image to be processed may be performed using a text recognition model, which may be any suitable deep learning model pre-trained using a large number of text samples.

According to some embodiments of the present disclosure, text recognition of the one or more text boxes may be based on a convolutional recurrent neural network CRNN model. The CRNN model is composed of a convolutional neural network (Convolutional Neural Networks, CNN) model for feature extraction of pictures, a recurrent neural network (Recurrent Neural Networks, RNN) for processing text sequences of indefinite length (e.g., a two-way long and short memory network BiLSTM), and a connected time classification (Connectionist Temporal Classification, CTC) model for alignment of text samples. Because the CRNN model can be trained end to end, and the CTC model can identify and align text samples with different lengths without character segmentation on the text samples, the accuracy of text identification is improved. Furthermore, the CRNN model is prone to convergence and is highly robust.

According to other embodiments of the present disclosure, the text recognition model may also include, but is not limited to: a Sequence2Sequence based text recognition model, a correction based text recognition model, an Attention based text recognition model, a segmentation based text recognition model, or a transform based method.

It will be appreciated that the above examples are shown for illustrative purposes and that other text recognition methods or models are possible.

Fig. 2 shows a flow chart of another image recognition method 200 according to an exemplary embodiment. As shown in fig. 2, the image recognition method 200 may include: steps S210 to S250 similar to steps S110 to S150 in the image recognition method 100 described with reference to fig. 1; and step S260, preprocessing the image to be processed. The preprocessing the image to be processed in step S260 may include: step S261, performing target detection on the image to be processed; and step S262, cutting the image to be processed based on the result of the target detection.

According to some examples, the image to be processed may be targeted based on a Two stage targeting algorithm (e.g., R-CNN, SPP-Net, fast R-CNN, R-FCN, etc.) or on stage targeting algorithm (e.g., overFeat, YOLOv, SSD, etc.). Then, the image to be processed is cut based on the result of the target detection (for example, a rectangular frame including the target and coordinate information thereof) to obtain an enlarged image to be processed including the target.

By performing the preprocessing operation of target detection and cutting on the image to be processed, the area irrelevant to the target in the image to be processed can be removed, so that the image recognition method is suitable for various images to be processed (such as lower brightness), and the accuracy of text detection and text recognition on the image is improved.

Fig. 3 shows a flow chart of preprocessing an image to be processed according to an exemplary embodiment. As shown in fig. 3, the preprocessing the graphics to be processed in step S260 may include: steps S361-S362, step S363, similar to steps S261-S262 in the image recognition method 200 described with reference to fig. 2, determine an average length of the one or more text boxes based on the first location information of the one or more text boxes; step S364, selecting at least one second text box with the text box length larger than the average length from the one or more text boxes; step S365, determining a rotation angle of each second text box in the at least one second text box, wherein after rotating each second text box by the rotation angle, each second text box is rotated to a horizontal position; step S366, determining an average rotation angle of the at least one second text box based on the rotation angle of each second text box; and step S367, rotating the image to be processed based on the average rotation angle.

In step S363, the length of each text box may be determined according to the coordinate information of the text box, and then the average length is determined based on the length of each text box. For example, if the coordinate information of text box i indicates four vertices of the text boxThe coordinates of (2) are [ (x) _i,1 ,y _i,1 )(x _i,2 ,y _i,2 )(x _i,3 ,y _i,3 )(x _i,4 ,y _i,4 )]Where the x-coordinate represents a vertical direction and the y-coordinate represents a horizontal direction, the length of the text box i can be calculated as follows:

or (b)

Or (b)

Also, the average length of one or more text boxes may be calculated as follows:

where N represents the total number of text boxes.

It will be appreciated that other suitable formulas or algorithms may be employed to calculate the length of the text box, and are not limited to the examples described above.

Then, in step S364, a text box having a length greater than the average length (i.e., L _i >L _ave ) And in step S365, a rotation angle of each selected text box may be determined based on the coordinate information of the selected text box.

Continuing with the above example, if text box i is one of the selected text boxes having a length greater than the average length, the corresponding rotation angle α _i For example, the following can be calculated:

or (b)

Or (b)

Then, in step S366, the average rotation angle of the selected text box may be calculated as follows:

Where n represents the total number of text boxes selected.

It will be appreciated that other suitable formulas or algorithms may be employed to calculate the angle of rotation of the text box, and are not limited to the examples described above.

Then, in step S367, the image to be processed is rotated based on the average rotation angle.

By rotating the image to be processed, the dependence of the image identification method on the image shooting angle, the placement position and the like is weakened, and the image identification method is beneficial to being adapted to more various images to be processed. Meanwhile, since the rotation angle of the relatively longer text box can indicate a more accurate rotation angle (because the direction of the relatively longer text box is stable), the image to be processed is rotated based on the average rotation angle of the text box with the length larger than the average length, and the image to be processed can be better processed in a correcting manner.

According to other embodiments of the present disclosure, the second text box may also be selected from the one or more text boxes by: for each of the one or more text boxes, determining a number of characters in the text box based on the identified text information; determining an average number of characters for the one or more text boxes; and selecting a text box with the character number larger than the average character number from the one or more text boxes as a second text box.

According to further embodiments of the present disclosure, the second text box may also be selected from the one or more text boxes by: determining a length of each text box based on position information (e.g., coordinate information) of the text box; and in response to determining that the length of the text box exceeds a preset threshold (e.g., 3, 5, etc.), determining the text box as a second text box.

According to further embodiments of the present disclosure, the second text box may also be selected from the one or more text boxes by: determining the number of characters in each text box based on the identified text information of the text box; and in response to determining that the number of characters of the text box exceeds a preset threshold (e.g., 5, 10, etc.), determining the text box as a second text box.

According to some embodiments of the present disclosure, step S260, preprocessing the graphics to be processed may further include: rotating one or more text boxes in the image to be processed based on the average rotation angle; and updating the first position information based on the average rotation angle.

The position information of the text box is four vertex coordinates [ (x) _i,1 ,y _i,1 )(x _i,2 ,y _i,2 )(x _i,3 ,y _i,3 )(x _i,4 ,y _i,4 )]The updated four vertex coordinates [ (x) _i,1 ′,y _i,1 ′)(x _i,2 ′,y _i,2 ′)(x _i,3 ′,y _i,3 ′)(x _i,4 ′,y _i,4 ′)]For example, the following relationship may be provided: x is x _i,1 ′＝x _i,2 ′、x _i,3 ′＝x _i,4 ′、y _i,1 ′＝y _i,3 ' and y _i,2 ′＝y _i,4 ′。

It will be appreciated that the above examples are shown for illustrative purposes only and that the relationship between the four vertex coordinates after updating is not limited to the above examples. For example, the x-coordinate of the upper left vertex and the upper right vertex of the rotated text boxThe difference in x-coordinates may be less than a preset threshold, i.e., |x _i,1 ′-x _i,2 ′|<V _thrs Or the difference between the y-coordinate of the upper left vertex and the y-coordinate of the lower left vertex of the rotated text box may be smaller than another preset threshold, i.e., |y _i,1 ′-y _i,3 ′|<V _thrs ' wherein V _thrs And V is equal to _thrs ' may be the same or different and is, for example, 0.1, 0.25, 0.5, etc. In this case, the text box may be further rotated according to the difference of coordinates until the four vertex coordinates after updating have x _i,1 ′＝x _i,2 ′、x _i,3 ′＝x _i,4 ′、y _i,1 ′＝y _i,3 ' and y _i,2 ′＝y _i,4 ' relationship.

By rotating each text box in the image to be processed by an average rotation angle, the correction of each text box is realized, the accuracy of text recognition is improved, and the row and column information of each text box in the table area can be determined more truly, so that the structural relation among the text information in the table area can be reproduced better.

According to some embodiments of the present disclosure, step S140, determining whether a form area exists in the image to be processed based on the one or more text information may include: determining, for each of the one or more text messages, whether the text message matches pre-set header information; and determining that a table area exists in the image to be processed in response to determining that the text information is matched with the preset header information. By identifying whether the text box in the image to be processed comprises text information matched with common header information, whether the image to be processed comprises a table area can be accurately determined to a certain extent, and a foundation is laid for retaining and reproducing the structural relation between the text information in the table area.

According to some examples, the pre-set header information may be pre-stored header information, such as date, income, expense, and the like. Further, the preset header information can be continuously updated in the process of using the image recognition method, so that the preset header information comprises more comprehensive header information as much as possible, and a table area in the image to be processed is more accurately determined.

According to some examples, a text matching algorithm may be utilized to determine whether text information obtained after text recognition matches pre-set header information, e.g., after extracting the overall semantics of the text information using a deep learning model, calculating a similarity score for each of the pre-set header information, and determining whether the text information matches the pre-set header information based on the similarity score. The scope of the claims of the present disclosure is not limited in this respect.

In step S150, according to some embodiments of the present disclosure, the table rank information may include: a row identifier and a column identifier indicating the row and column, respectively, of each first text box in the table area. For example, the table rank information may be a two-dimensional array having a number equal to the total number of the first text boxes, e.g., the table rank information for the first text box i is (a, b), indicating that the first text box i is located in the a-th row and b-th column of the table area. Thus, structured information within the form area can be quickly obtained for output.

Fig. 4 shows a flowchart of determining table rank information for text boxes within a table area in this case, according to an embodiment of the present disclosure. As described in fig. 4, the determining, based on the first location information of at least one first text box located within the table area of the one or more text boxes, table rank information of each of the at least one first text box may include: step S410, sorting the at least one first text box according to the vertical direction coordinates indicated by the first position information to obtain a sorted first text box set; step S420, for each first text box, determining whether the difference between the vertical coordinates of the first text box and the previous first text box in the first text box set exceeds a first preset threshold; and step S430, in response to determining that the difference between the vertical coordinates of the first text box and the previous first text box in the first text box set exceeds a first preset threshold, incrementing a row identifier corresponding to the first text box by 1.

In step S410, the at least one first text box may be ordered according to an increasing order of the vertical direction coordinates. Continuing the first position information to four vertex coordinates [ (x) of each text box _i,1 ,y _i,1 )(x _i,2 ,y _i,2 )(x _i,3 ,y _i,3 )(x _i,4 ,y _i,4 )]May be ordered with the top left vertex coordinates (x _i,1 ,y _i,1 ) Based on, i.e. according to x _i,1 The at least one first text box is ordered from small to large, possibly also with the lower left vertex coordinates (x _i,3 ,y _i,3 ) Based on, i.e. according to x _i,3 The at least one first text box is ordered from small to large.

In step S420, the first preset threshold may be any suitable value, such as 0.5, 1, 1.5, etc. Similarly, when calculating the difference between the vertical coordinates of a certain first text box and its previous first text box, the upper left vertex of the text box may be used as a reference. That is, for the i-1 th first text box and the i-th text box, the difference in vertical coordinates thereof can be expressed as x _i,1 -x _i-1,1 . It will be appreciated that the difference in vertical coordinates of each first text box and its previous first text box may also be calculated based on the lower left vertex of the text box.

In step S430, in response to determining x _i,1 -x _i-1,1 >V _thrs1 It may be determined that the i-1 th first text box and the i-th text box are located in different rows and the row identifier of the i-th text box is incremented by 1. In response to determining x _i,1 -x _i-1,1 ≤V _thrs1 It may be determined that the i-1 th first text box and the i-th text box are located in the same line (have the same line identifier).

With continued reference to fig. 4, step S150, determining the table rank information of each of the at least one first text box based on the first location information of the at least one first text box located within the table area of the one or more text boxes may further include: step S440, sorting the at least one first text box according to the horizontal direction coordinates indicated by the first position information to obtain a sorted second text box set; step S450, determining, for each first text box, whether the difference between the horizontal coordinates of the first text box and the previous first text box in the second text box set exceeds a second preset threshold; and step S460, in response to determining that the difference between the horizontal direction coordinates of the first text box and the previous first text box in the second text box set exceeds a second preset threshold, determining a column identifier corresponding to the first text box based on the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box.

In step S440, the at least one first text box may be ordered according to an increasing order of the horizontal direction coordinates. Continuing the first position information to four vertex coordinates [ (x) of each text box _i,1 ,y _i,1 )(x _i,2 ,y _i,2 )(x _i,3 ,y _i,3 )(x _i,4 ,y _i,4 )]May be ordered with the top left vertex coordinates (x _i,1 ,y _i,1 ) Based on, i.e. according to y _i,1 The at least one first text box is ordered from small to large, possibly also with the lower left vertex coordinates (x _i,3 ,y _i,3 ) Based on, i.e. according to y _i,3 The at least one first text box is ordered from small to large.

In step S450, the second preset threshold may be any suitable value (which may be the same as the first preset threshold or different from the first preset threshold, for example, 1, 2, 3, etc.). Similarly, when calculating the difference between the horizontal coordinates of a certain first text box and its preceding first text box, the upper left vertex of the text box may be used as a reference. That is, for the i-1 th first text box and the i-th first text box, the difference in horizontal direction coordinates thereof can be expressed as y _i,1 -y _i-1,1 . It will be appreciated that the horizontal coordinates of each first text box and its preceding first text box may also be calculated based on the lower left vertex of the text boxAnd (3) difference.

In step S460, in response to determining y _i,1 -y _i-1,1 >V _thrs2 The column identifier corresponding to the ith text box may be determined based on the horizontal direction coordinates of the (i-1) th first text box and the horizontal direction coordinates of the (i) th text box.

According to some embodiments of the present disclosure, determining a column identifier corresponding to the first text box based on the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box may include: calculating an average value of the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box as a column separation coordinate; and determining a column identifier corresponding to the first text box based on the column separation coordinates. For example, the horizontal coordinate y of the ith first text box may be calculated _i,1 The horizontal coordinate y with the i-1 th first text box _i-1,2 Average of (1), i.e. calculate As column separation coordinates.

From the horizontal and vertical coordinates of each text box within the form area, it is possible to quickly and accurately determine which row and which column the text box is located in the form area. Since the row identifier and the column identifier indicating the row and column information in the table area are important factors representing the structural relationship between the text information in the image to be processed, the arrangement relationship between the text boxes in the table area can be determined, thereby being beneficial to restoring the structural information associated with the table area in the image to be identified. In addition, the structured data can be used as output for a user more clearly and simply, and can be exported as an electronic document such as Excel for subsequent processing.

According to some embodiments of the present disclosure, the image recognition method 100 or 200 may further include: sorting the at least one third text box located outside the form area according to third location information of the at least one third text box; and outputting data corresponding to the non-form area of the image to be processed based on the ordering result and the text information contained in the at least one third text box.

In some examples, the table regions and non-table regions in the image to be processed may be determined in the manner described above. In the case that no table area exists in the image to be processed or for text boxes in a regular area outside the table area in the image to be processed, the text boxes may be ordered in a preset order (for example, a general reading order: left to right, top to bottom, etc.), and then text information contained in the ordered text boxes is sequentially output.

In general, there may not be obvious structured information in the regular areas outside the form area, and the user may be more concerned with specific text contents in these areas than with the specific positions in the image, so that text boxes in these areas are ordered and output in a preset order such as a general reading order, and text contents more conforming to the user's needs can be output while simplifying the operation.

Fig. 5 shows a flow chart of yet another image recognition method 500 according to an exemplary embodiment. As shown in fig. 5, the image recognition method 500 may include: steps S510-S550 similar to steps S110-S150 in the image recognition method 100 described with reference to fig. 1 or steps S210-S250 in the image recognition method 200 described with reference to fig. 2; and step S560, performing text correction on the text information contained in the at least one first text box.

In general, there may be a plurality of text types (e.g., english, chinese, numerals) in a paper document in an image to be processed, and there may be a plurality of formats (e.g., having different tabular forms), and thus, there may be a high error rate in recognizing text information by a conventional text recognition algorithm. By further text correction of the text information identified in the text box, the accuracy of outputting the text can be improved.

It will be appreciated that although step S560 in the image recognition method 500 is shown in fig. 5 as being performed after step S550 outputs structured data corresponding to a form area of an image to be processed, in alternative embodiments, at least one first text box located within the form area and its location information and form rank information may be first determined from one or more text boxes, then text correction is performed on the text information contained by these first text boxes, and after the text correction the structured data corresponding to the form area of the image to be processed is output.

Fig. 6 shows a flowchart for text correction of text information according to an exemplary embodiment. As shown in fig. 6, in step S560, performing text correction on the text information included in the at least one first text box may include: step S610, obtaining a header corpus; step S620, determining at least one fourth text box associated with header information in the at least one first text box; and step S630, determining whether to replace the text information contained in the at least one fourth text box by the corresponding header in the header corpus according to whether the text information contained in the at least one fourth text box is consistent with the header in the header corpus.

By acquiring a common header corpus in advance and comparing text information contained in a text box which is determined to be a header through text recognition with the header in the header corpus, the header text can be automatically corrected to a certain extent, so that the labor force of manual auditing is relieved, and the accuracy of text output is improved.

In step S610, the header corpus may include a plurality of header regular texts stored in advance, such as date, expense, income, name, gender, and the like. The header regular text is not limited to chinese text, but may be english text, such as Date, payment type and details, paid out, paid in, balance, name, gender, and the like. Further, the table corpus may be continuously updated during processing of the image to be processed using the image recognition method herein. For example, if it is found in a later manual review that a header text in the tabular region of the image to be processed does not match any header in the header corpus, the header text may be added to the header corpus.

In step S620, for each text box in the table area of the image to be processed, the at least one fourth text box may be determined by determining, by a text matching algorithm, whether the text information contained in the text box matches a header in the obtained header corpus. It will be appreciated that "matching" does not require that the text information contained by the text box be completely consistent with a header included in the header corpus. For example, if the header corpus includes a header of date, expense, income, name, gender, then in the case where the text box contains text information such as start date, end date, it may be determined that it matches the header "date" in the header corpus, and the text box containing the text information "start date" and "end date" is determined to be the fourth text box.

With continued reference to fig. 6, in step S630, determining whether to replace the text information contained by the at least one fourth text box with a corresponding header in the header corpus according to whether the text information contained by the at least one fourth text box is consistent with the header in the header corpus may include: step S632, for each fourth text box in the at least one fourth text box, in response to determining that the text information contained in the fourth text box is inconsistent with all the headers in the header corpus, calculating the string editing distance between the text information contained in the fourth text box and all the headers in the header corpus; step S634, selecting the text information contained in the fourth text box and the header with the minimum character string editing distance from all headers in the header corpus; and step S636, in response to determining that the minimum string edit distance is greater than or equal to 1 and less than the length of the string in the text information contained in the fourth text box, replacing the text information contained in the fourth text box with the selected header.

In step S632, the string edit distance may represent the number of operations required to change from a string included in text information to a header in the header corpus when the text information included in the text box does not coincide with the header in the header corpus. The number of operations may include, for example, adding characters, deleting characters, or changing characters.

Continuing the example in which the header included in the header corpus is date, expense, income, name, gender, and text information included in the text box is "start date", it is possible to determine that the number of operations required to change from the character string "start date" to the character string "date" is 2, that is, delete character "on" and delete character "start", and the number of operations required to change from the character string "start date" to the character string "expense" is 4: i.e., the character "on" is replaced with "off", the character "off" is replaced with "off", the character "day" is deleted, and the character "period" is deleted. The number of operations required to change from a string included in the text information to each header in the header corpus may be calculated in a similar manner.

Then, in step S634, a header having the smallest editing distance to the character string of the text information contained in the text box may be selected therefrom. And in step S636, if it is determined that the minimum string edit distance is 1 or more and less than the length of the string in the text information contained in the text box, it may be determined that there is a header corresponding to the text information contained in the text box in the header corpus, that is, for example, referring to an example in which the header included in the header corpus is date, expense, income, name, gender, and the text information contained in the text box is "start date". In this case, the text information contained in the text box may be replaced with the corresponding header in the header corpus (e.g., the text information "start date" is replaced with the header "date").

Alternatively, one or more headers corresponding to the text information contained in the text box may be first determined from the header corpus, and then the minimum number of operations required to change from the character string contained in the text information to the corresponding one or more headers may be calculated as the character string minimum editing distance.

With continued reference to fig. 6, according to some embodiments of the present disclosure, step S630, determining whether to replace the text information contained in the at least one fourth text box with a corresponding header in the header corpus according to whether the text information contained in the at least one fourth text box is consistent with the header in the header corpus may further include: step S638, in response to determining that the minimum string edit distance is equal to or greater than the length of the string in the text information contained in the fourth text box, retaining the text information.

In step S638, if it is determined that the minimum string edit distance is equal to or greater than the length of the string in the text information contained in the text box, it may be determined that there is no header corresponding to the text information contained in the text box in the header corpus (for example, referring to an example in which the header included in the header corpus is date, expense, income, name, gender, wherein the text information contained in the text box is "payment details"). In this case, the text information may be retained. Further, the text information can be added to the header corpus as a new header, so that the header corpus comprises a more comprehensive conventional header, and the method is beneficial to more accurately determining whether the recognized text needs to be corrected when the image recognition method is executed, thereby improving the accuracy of outputting the text.

According to some embodiments of the present disclosure, step S560, performing text correction on the text information included in the at least one first text box may include: for each of one or more fifth text boxes of the at least one first text box that are not related to header information, determining an attribute of a fourth text box corresponding to the fifth text box, the attribute indicating a category of text information contained by the fourth text box; and carrying out text correction on the text information contained in the fifth text box according to the attribute of the fourth text box corresponding to the fifth text box, so that the corrected text information meets the attribute of the corresponding fourth text box.

According to some embodiments of the present disclosure, the attributes of the text box may include one or more of the following: date, number, string language. For example, text boxes with the same column identifier may have the same attributes and indicated by corresponding headers.

In some examples, for text box i, if it is determined that its corresponding header indicates a "date" attribute and that it contains text information of "2022, 10, 1 month" by text recognition, it may be determined that text correction of the text box is required, i.e., that "2022, 10, 1 month" is modified to "2022, 10, 1 month.

In other examples, for text box j, if it is determined that the attribute of its corresponding header indication is "number" and that the text information it contains is "100O" by text recognition, it may be determined that text correction of the text box is required, i.e., "100O" is modified to "1000".

In still other examples, for text box k, if it is determined that the attribute indicated by its corresponding header is "string language-english", and that the text information it contains is "Ba1ance" (i.e., the english letter "l" in "Balance" is erroneously recognized as the number "1") by text recognition, it may be determined that text correction is required for the text box, i.e., that "Ba1ance" is modified to "Balance".

Thus, the attribute of the text box in the table area can be fully utilized, and the text information contained in the text box can be corrected based on the attribute. The method not only improves the accuracy of the output structured data corresponding to the form area, automatically corrects the text information, but also greatly reduces the requirement for manual auditing, thereby saving the labor cost.

It will be appreciated that embodiments for attribute information are shown for illustrative purposes, and that in other embodiments, other possible attribute information may also be included and may be updated continuously.

Fig. 7 shows a block diagram of a structure of an image recognition apparatus 700 according to an exemplary embodiment. As shown in fig. 7, the apparatus 700 may include an acquisition module 710 configured to acquire an image to be processed; a text detection module 720 configured to perform text detection on the image to be processed to obtain one or more text boxes and first position information of the one or more text boxes; a text recognition module 730 configured to perform text recognition on the one or more text boxes to obtain one or more text information contained in the one or more text boxes; a determine form area module 740 configured to determine whether a form area exists in the image to be processed based on the one or more text information; a determine form information module 750 configured to determine form rank information for each of the at least one first text box based on first location information for at least one first text box within the form region of the one or more text boxes in response to determining that a form region exists in the image to be processed; and an output module 760 configured to output structured data corresponding to the form area of the image to be processed based on the form rank information of each first text box and the contained text information of each first text box.

It should be appreciated that the various modules 710-760 of the apparatus 700 shown in fig. 7 may correspond to the various steps S110-S150 in the method 100 described with reference to fig. 1, the various steps S210-S250 in the method 200 described with reference to fig. 2, or the various steps S510-S550 in the method 500 described with reference to fig. 5. Thus, the operations, features, and advantages described above with respect to methods 100, 200, and 500 apply equally to apparatus 700 and the modules comprised thereby. For brevity, certain operations, features and advantages are not described in detail herein.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 7 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the acquisition module 710, the text detection module 720, the text recognition module 730, the determination form area module 740, the determination form information module 750, and the output module 760 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

Another aspect of the present disclosure may include an electronic device that may include a memory, a processor, and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the image recognition method described above.

Yet another aspect of the present disclosure may include a computer-readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the image recognition method described above.

Yet another aspect of the present disclosure may include a computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the image recognition method described above.

With reference to fig. 8, a computing device 800 will now be described, which is an example of a hardware device that may be applied to aspects of the present disclosure. Computing device 800 may be any machine configured to perform processes and/or calculations and may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a smart phone, an in-vehicle computer, a door access system, an attendance device, or any combination thereof. The image recognition apparatus described above may be implemented in whole or at least in part by computing device 800 or a similar device or system. While computing device 800 represents one example of several types of computing platforms, computing device 800 may include more or less elements and/or a different arrangement of elements than shown in fig. 8, and is not limiting in scope of the claimed subject matter in these respects.

In some embodiments, computing device 800 may include elements that are connected to bus 802 or communicate with bus 802 (possibly via one or more interfaces). For example, computing device 800 may include a bus 802, one or more processors 804, one or more input devices 806, and one or more output devices 808. The one or more processors 804 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., special processing chips). Input device 806 may be any type of device capable of inputting information to computing device 800 and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote control. Output device 808 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Computing device 800 may also include a non-transitory storage device 810 or be connected to non-transitory storage device 810. A non-transitory storage device may be any storage device that is non-transitory and that may enable data storage, and may include, but is not limited to, a magnetic disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, an optical disk or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer may read data, instructions, and/or code. The non-transitory storage device 810 may be detachable from the interface. The non-transitory storage device 810 embodies one or more non-transitory computer readable media having stored thereon a program comprising instructions that, when executed by one or more processors of the computing device 800, cause the computing device 800 to perform the image recognition methods 100, 200, 500 and variations thereof described above. Computing device 800 may also include communication device 812. Communication device 812 may be any type of device or system that enables communication with external devices and/or with a network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a bluetooth (TM) device, an 802.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

In some embodiments, computing device 800 may also include a working memory 814, which may be any type of memory that may store programs (including instructions) and/or data useful for the operation of processor 804, and may include, but is not limited to, random access memory and/or read-only memory devices.

Software elements (programs) may reside in the working memory 814 including, but not limited to, an operating system 816, one or more application programs 818, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more applications 818 and the electronic circuitry of the above-described image depth determining apparatus may be implemented by instructions of the one or more applications 818 being read and executed by the processor 804. Executable code or source code of instructions of software elements (programs) may be stored in a non-transitory computer readable storage medium (such as the storage device 810 described above) and may be stored (possibly compiled and/or installed) in working memory 814 when executed. Executable code or source code for instructions of software elements (programs) may also be downloaded from a remote location.

It should also be understood that various modifications may be made according to specific requirements. For example, custom hardware may also be used, and/or particular elements may be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented by programming hardware (e.g., programmable logic circuits including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) in an assembly language or hardware programming language such as VERILOG, VHDL, c++ using logic and algorithms according to the present disclosure.

It should also be appreciated that the foregoing method may be implemented by a server-client mode. For example, a client may collect image data with a camera and send the image data to a server for subsequent processing. The client may also perform a portion of the processing of the foregoing method and send the processed data to the server. The server may receive data from the client and perform the aforementioned method or another part of the aforementioned method and return the execution result to the client. The client may receive the result of the execution of the method from the server and may present it to the user, for example, via an output device.

It should also be appreciated that components of computing device 800 may be distributed across a network. For example, some processes may be performed using one processor while other processes may be performed by another processor remote from the one processor. Other components of computing device 800 may also be similarly distributed. As such, computing device 800 may be understood as a distributed computing system executing processes at multiple locations on multiple processors.

Some exemplary aspects of the disclosure are described below.

Aspect 1. An image recognition method includes:

Acquiring an image to be processed;

text detection is carried out on the image to be processed so as to obtain one or more text boxes and first position information of the one or more text boxes;

performing text recognition on the one or more text boxes to obtain one or more text messages contained in the one or more text boxes;

determining whether a form area exists in the image to be processed based on the one or more text information; and

in response to determining that a table region exists in the image to be processed:

determining table rank information corresponding to each of at least one first text box in the table area based on the first position information of the at least one first text box in the one or more text boxes; and is also provided with

And outputting structured data corresponding to the table area of the image to be processed based on the table rank information of each first text box and the text information contained in each first text box.

Aspect 2. The method according to aspect 1, further comprising preprocessing the image to be processed, wherein the preprocessing includes:

performing target detection on the image to be processed; and

And cutting the image to be processed based on the result of target detection.

Aspect 3 the method of aspect 2, wherein the preprocessing further comprises:

determining an average length of the one or more text boxes based on the first location information of the one or more text boxes;

selecting at least one second text box having a text box length greater than the average length from the one or more text boxes;

determining a rotation angle of each second text box in the at least one second text box, wherein after rotating each second text box by the rotation angle, each second text box is rotated to a horizontal position;

determining an average rotation angle of the at least one second text box based on the rotation angle of each second text box; and

and rotating the image to be processed based on the average rotation angle.

Aspect 4 the method of aspect 3, wherein the preprocessing further comprises:

rotating the one or more text boxes in the image to be processed based on the average rotation angle; and

the first position information is updated based on the average rotation angle.

Aspect 5. The method of aspect 1, wherein determining whether a form area exists in the image to be processed based on the one or more text information comprises:

determining, for each of the one or more text messages, whether the text message matches with pre-set header information; and

and determining that a table area exists in the image to be processed in response to determining that the text information is matched with the preset header information.

Aspect 6. The method of aspect 1, wherein the table rank information comprises: a row identifier and a column identifier indicating a row and a column, respectively, of said each first text box in said table area.

Aspect 7. The method of aspect 6, wherein determining, based on the first location information of at least one first text box of the one or more text boxes that is located within the form area, form rank information corresponding to each of the at least one text box comprises:

sorting the at least one first text box according to the vertical direction coordinates indicated by the first position information to obtain a sorted first text box set;

Determining, for each first text box, whether a difference between vertical coordinates of the first text box and a previous first text box in the first text box set exceeds a first preset threshold; and

in response to determining that a difference in vertical coordinates between the first text box and a previous first text box in the first set of text boxes exceeds a first preset threshold, a row identifier corresponding to the first text box is incremented by 1.

Aspect 8 the method of aspect 7, wherein determining, based on the first location information of at least one first text box of the one or more text boxes that is located within the form area, form rank information corresponding to each of the at least one text box further comprises:

sorting the at least one first text box according to the horizontal direction coordinates indicated by the first position information to obtain a sorted second text box set;

determining, for each first text box, whether a difference between horizontal coordinates of the first text box and a previous first text box in the second text box set exceeds a second preset threshold;

in response to determining that the difference between the horizontal direction coordinates of the first text box and a previous first text box in the second set of text boxes exceeds a second preset threshold, a column identifier corresponding to the first text box is determined based on the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box.

Aspect 9. The method of aspect 8, wherein determining the column identifier corresponding to the first text box based on the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box comprises:

calculating an average value of the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box as a column separation coordinate; and

a column identifier corresponding to the first text box is determined based on the column separation coordinates.

Aspect 10. The method of aspect 1, further comprising:

sorting at least one third text box located outside the form area according to third position information of the at least one third text box in the one or more text boxes; and

and outputting data corresponding to the non-form area of the image to be processed based on the ordering result and text information contained in the at least one third text box.

Aspect 11. The method of aspect 1, further comprising: and carrying out text correction on the text information contained in the at least one first text box.

Aspect 12. The method of aspect 11, wherein text correction of the text information contained in the at least one first text box comprises:

Acquiring a header corpus;

determining at least one fourth text box associated with header information in the at least one first text box; and

and determining whether to replace the text information contained in the at least one fourth text box by a corresponding header in the header corpus according to whether the text information contained in the at least one fourth text box is consistent with the header in the header corpus.

Aspect 13. The method of aspect 12, wherein determining whether to replace the text information contained by the at least one fourth text box with a corresponding header in the header corpus based on whether the text information contained by the at least one fourth text box is consistent with a header in the header corpus comprises, for each fourth text box in the at least one fourth text box:

calculating the character string editing distance between the text information contained in the fourth text box and all the headers in the header corpus in response to determining that the text information contained in the fourth text box is inconsistent with all the headers in the header corpus;

selecting the text information contained in the fourth text box and the header with the minimum character string editing distance from all headers in the header corpus; and

In response to determining that the minimum string edit distance is greater than or equal to 1 and less than the length of the string in the text information contained in the fourth text box, replacing the text information contained in the fourth text box with the selected header.

Aspect 14. The method of aspect 12, wherein determining whether to replace the text information contained by the at least one fourth text box with a corresponding header in the header corpus based on whether the text information contained by the at least one fourth text box is consistent with a header in the header corpus further comprises:

and responding to the fact that the minimum editing distance of the character strings is larger than or equal to the length of the character strings in the text information contained in the fourth text box, and reserving the text information.

Aspect 15 the method of any one of aspects 12-14, wherein text correcting the text information contained by the at least one first text box includes:

for each fifth text box of one or more fifth text boxes which are not related to header information in the at least one first text box, determining an attribute of a fourth text box corresponding to the fifth text box, wherein the attribute indicates the category of text information contained in the fourth text box; and

And carrying out text correction on the text information contained in the fifth text box according to the attribute of the fourth text box corresponding to the fifth text box, so that the corrected text information meets the attribute of the corresponding fourth text box.

Aspect 16 the method of aspect 15, wherein the attributes include one or more of: date, number, string language.

Aspect 17 the method of any one of aspects 1-16, wherein text recognition of the one or more text boxes is based on a convolutional recurrent neural network CRNN model.

Aspect 18. An image recognition apparatus includes:

an acquisition module configured to acquire an image to be processed;

a text detection module configured to perform text detection on the image to be processed to obtain one or more text boxes and first position information of the one or more text boxes;

a text recognition module configured to perform text recognition on the one or more text boxes to obtain one or more text information contained in the one or more text boxes;

a form area determining module configured to determine whether a form area exists in the image to be processed based on the obtained one or more pieces of text information;

A determination form information module configured to determine form rank information corresponding to each of at least one first text box of the one or more text boxes based on the first location information of the at least one first text box located within the form region in response to determining that a form region exists in the image to be processed; and

and the output module is configured to output structured data corresponding to the table area of the image to be processed based on the table information of each first text box and the text information contained in each first text box.

Aspect 19. An electronic device, comprising:

a memory, a processor and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method according to any one of aspects 1-17.

Aspect 20. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of aspects 1-17.

Aspect 21. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the method according to any of aspects 1-17.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An image recognition method, comprising:

acquiring an image to be processed;

2. The method of claim 1, wherein the table rank information comprises: a row identifier and a column identifier indicating a row and a column, respectively, of said each first text box in said table area.

3. The method of claim 2, wherein determining, based on the first location information of at least one first text box of the one or more text boxes that is located within the form area, form rank information corresponding to each text box of the at least one text box comprises:

4. The method of claim 3, wherein determining table rank information for each of the at least one text box based on the first location information for at least one first text box of the one or more text boxes that is located within the table area further comprises:

5. The method of claim 4, wherein determining a column identifier corresponding to the first text box based on the horizontal direction coordinates of the first text box and the horizontal direction coordinates of the previous first text box comprises:

6. The method of claim 1, further comprising:

7. An image recognition apparatus comprising:

an acquisition module configured to acquire an image to be processed;

8. An electronic device, comprising:

a memory, a processor and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method according to any of claims 1-6.

9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of claims 1-6.

10. A computer program product comprising a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of claims 1-6.