CN111259889A

CN111259889A - Image text recognition method and device, computer equipment and computer storage medium

Info

Publication number: CN111259889A
Application number: CN202010051370.3A
Authority: CN
Inventors: 刘舒萍
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-09

Abstract

The application discloses an image text recognition method, an image text recognition device and a computer storage medium, relates to the technical field of computers, and aims at images with complex scenes to perform structured processing on text recognition results and improve the accuracy of image text recognition. The method comprises the following steps: acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image; determining the position information of a text region in the target recognition image based on a pre-trained text region detection model; inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain text information in the text region; and structuring the text information in the text area to obtain a text field with a mapping relation.

Description

Image text recognition method and device, computer equipment and computer storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for recognizing image texts, a computer device, and a computer storage medium.

Background

With the development of science and technology, images play a great role in information dissemination. To better promote, more and more images are being added with text, for example, in connection with medical transaction platforms, medical institutions require users to upload invoice images to check invoices based on the text content in the uploaded invoice images. Therefore, since the texts in the images usually contain relatively rich information, extracting and identifying the texts in the images is of great significance for the analysis, understanding, information retrieval and other aspects of the image contents.

The existing image text recognition method generally detects a text information box in an image at first, then recognizes the detected text information box, and finally returns the result obtained by recognition, so that automatic recognition is achieved, and the labor input cost is saved.

However, in an image of an actual application scene, there are complex and various image contents, and texts in a invoice image or many natural images are usually affected by irregular image background contents, so that there are many situations of missing detection and false detection in the existing image text recognition method, and the recognition accuracy is low, thereby causing the problems that the final text recognition result is incomplete, and the field obtained by recognition cannot correspond to the required field, and seriously affecting the use of subsequent texts.

Disclosure of Invention

In view of the above, the present invention provides an image text recognition method, an image text recognition device, a computer device, and a computer storage medium, and mainly aims to solve the problem that the accuracy of image text recognition for a complex scene is low at present.

According to an aspect of the present invention, there is provided an image text recognition method, including:

acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image;

determining the position information of a text region in the target recognition image based on a pre-trained text region detection model;

inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain text information in the text region;

and structuring the text information in the text area to obtain a text field with a mapping relation.

Further, the structuring the text information in the text region to obtain a text field with a mapping relationship specifically includes:

selecting a preset field from the text information in the text area as a key field, and acquiring the position information of the text area corresponding to the key field;

determining a fuzzy region having a mapping relation with the key field according to the position information of the text region corresponding to the key field;

and detecting and inquiring the text information identified in the fuzzy area, and confirming the text information having a mapping relation with the key field.

Further, the determining a fuzzy region having a mapping relationship with the key field according to the position information of the text region corresponding to the key field specifically includes:

moving the text region corresponding to the key field by a preset distance along the horizontal and vertical directions, and acquiring the position information of the moved text region according to the position information of the text region corresponding to the key field;

and based on the position information of the moved text region, carrying out amplification processing on the text region after the preset distance is moved, and determining a fuzzy region having a mapping relation with the key field.

Further, the detecting and querying the text information identified in the fuzzy area and confirming the text information having a mapping relationship with the key field specifically include:

detecting the position information of all text areas in the fuzzy area, and extracting the text information of all text areas in the fuzzy area;

and traversing the text information of each text region in the fuzzy region in a regular matching mode, and confirming the text information which has a mapping relation with the key field.

Further, the traversing the text information of each text region in the fuzzy region in a regular matching manner to confirm the text information having a mapping relationship with the key field specifically includes:

constructing a regular expression matched with the key field by acquiring the mode character suitable for the key field;

and according to the regular expression matched with the key field, checking the text information of each text area in the fuzzy area, and confirming the text information having a mapping relation with the key field.

Further, before the determining the position information of the text region in the target recognition image based on the pre-trained text region detection model, the method further includes:

carrying out text region labeling on the collected image sample data, and inputting the image sample data into a network model for training to obtain a text region detection model;

the network model comprises a multilayer structure, the collected image sample data is input into the network model for training after text region labeling is carried out on the collected image sample data, and a text region detection model is obtained, and the method specifically comprises the following steps:

extracting image area features corresponding to the image sample data through the convolution layer of the network model;

generating horizontal text sequence characteristics according to image region characteristics corresponding to image sample data through a decoding layer of the network model;

and determining a text region in the image sample data according to the horizontal text sequence characteristics through a prediction layer of the network model, and processing the text region to obtain a candidate text line.

Further, the determining, by the prediction layer of the network model, a text region in the image sample data according to the horizontal text sequence feature and processing the text region to obtain a candidate text line includes:

classifying each region in the image sample data according to the horizontal text sequence features through a classification part of a prediction layer of the network model, and determining a text region in the image sample data;

and performing frame regression processing on the text region in the image text data through a regression part of a prediction layer of the network model to obtain candidate text lines.

According to another aspect of the present invention, there is provided an image text recognition apparatus, the apparatus including:

the device comprises an acquisition unit, a pre-processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be identified and pre-processing the image to be identified to obtain a target identification image;

a determining unit configured to determine position information of a text region in the target recognition image based on a pre-trained text region detection model;

the recognition unit is used for inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain text information in the text region;

and the processing unit is used for carrying out structuralization processing on the text information in the text area to obtain a text field with a mapping relation.

Further, the processing unit includes:

the selection module is used for selecting a preset field from the text information in the text area as a key field and acquiring the position information of the text area corresponding to the key field;

the determining module is used for determining a fuzzy region which has a mapping relation with the key field according to the position information of the text region corresponding to the key field;

and the detection module is used for detecting and inquiring the text information identified in the fuzzy area and confirming the text information which has a mapping relation with the key field.

Further, the determining module includes:

the obtaining sub-module is used for moving the text region corresponding to the key field by a preset distance along the horizontal direction and the vertical direction, and obtaining the position information of the moved text region according to the position information of the text region corresponding to the key field;

and the determining submodule is used for amplifying the post-text region after the preset distance is moved based on the position information of the post-text region after the movement, and determining a fuzzy region which has a mapping relation with the key field.

Further, the detection module includes:

the extraction submodule is used for detecting the position information of all the text areas in the fuzzy area and extracting the text information of all the text areas in the fuzzy area;

and the confirmation sub-module is used for traversing the text information of each text region in the fuzzy region in a regular matching mode and confirming the text information which has a mapping relation with the key field.

Further, the confirmation submodule is specifically configured to construct a regular expression matched with the key field by obtaining the mode character applicable to the key field;

the confirmation sub-module is specifically further configured to verify the text information of each text region in the fuzzy region according to the regular expression matched with the key field, and confirm the text information having a mapping relationship with the key field.

Further, the apparatus further comprises:

the training unit is used for performing text region labeling on collected image sample data and inputting the image sample data into a network model for training to obtain a text region detection model before determining the position information of a text region in the target recognition image based on the pre-trained text region detection model;

the network model comprises a multilayer structure, and the training unit comprises:

the extraction module is used for extracting image area characteristics corresponding to the image sample data through the convolution layer of the network model;

the generating module is used for generating horizontal text sequence characteristics according to image region characteristics corresponding to image sample data through a decoding layer of the network model;

and the prediction module is used for determining a text region in the image sample data according to the horizontal text sequence characteristics through a prediction layer of the network model, and processing the text region to obtain a candidate text line.

Further, the prediction layer of the network model includes a classification portion and a regression portion, the prediction module includes:

the classification submodule is used for classifying each region in the image sample data according to the horizontal text sequence characteristics through a classification part of a prediction layer of the network model, and determining a text region in the image sample data;

and the processing submodule is used for carrying out frame regression processing on the text area in the image text data through a regression part of a prediction layer of the network model to obtain candidate text lines.

According to yet another aspect of the present invention, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the image text recognition method when the processor executes the computer program.

According to a further aspect of the invention, a computer storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the image text recognition method.

By means of the technical scheme, the image text recognition method and the image text recognition device are characterized in that an image to be recognized is obtained, the image to be recognized is preprocessed to obtain a target recognition image, the position information of a text region in the target recognition image is determined based on a pre-trained text region detection model, after the target recognition image and the position information of the text region in the target recognition image are input into the pre-trained text recognition model, the text information in the text region is recognized, and the text information in the text region is structurally processed into a text field with a mapping relation. Compared with the image text recognition method in the prior art, the method has the advantages that the recognized text information is subjected to structural processing, so that the interference information in the image can be effectively removed, the text information of the image is accurately reserved, the image detected in the text area and recognized by the text information is not interfered by the background, the corresponding relation of different fields is output and realized, and the accuracy of image text recognition is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart illustrating an image text recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another image text recognition method provided by the embodiment of the invention;

fig. 3 is a schematic structural diagram illustrating an image text recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram illustrating another image text recognition apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides an image text recognition method, which can perform safety authentication on a service calling party and ensure the safety of service calling, and as shown in figure 1, the method comprises the following steps:

101. and acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image.

The image to be identified can be an invoice image, an advertisement image, a commodity image and the like. The preprocessing process for the image to be recognized can include, but is not limited to, correcting the image rotation and removing background interference from the image.

It can be understood that the user may not consider the direction of the image and the shooting angle when uploading the image to be recognized, for example, some users are used to horizontal shooting and some users are used to vertical shooting. In order to facilitate subsequent image processing, the image needs to be angle-normalized, here, a 4-class model may be trained in advance by using a Resnet152 network based on a neural network, and after the image to be recognized is input by the class model, a prediction angle may be given: and performing rotation correction on the image to be recognized at 0 degree, 90 degrees, 180 degrees and 270 degrees according to the prediction angle output by the classification model.

The implementation process of the training process of the specific classification model can be as follows: firstly, 4 kinds of images are prepared as training data, namely 0-degree, 90-degree, 180-degree and 270-degree images and corresponding angle labels; and then inputting training data into a Resnet152 network, extracting the characteristics of each image by the network, predicting the corresponding angle, performing backward propagation based on the deviation between the predicted value and the actual value of the angle label as loss, updating parameters in the network, after training of the classification model is completed, keeping forward propagation, performing angle prediction on the image to be recognized to obtain a predicted angle, rotationally correcting the image to be recognized with the predicted angle different from 0 degree to 0 degree, so that the predicted angles corresponding to the image to be recognized are unified to 0 degree, and thus obtaining the image to be recognized with the same angle.

It can be understood that the image to be recognized uploaded by the user usually contains complex background information, some background textures are very similar to text textures, some even contain texts, the backgrounds greatly interfere recognition and detection, and image background interference needs to be removed, where the image to be recognized is subjected to background and foreground segmentation based on a network model constructed by deep lab-v3 in combination with an image segmentation technology, a minimum external rectangle of the foreground is calculated, and the background is cut off along a minimum external matrix.

Specifically, the implementation process of performing image background and foreground segmentation on an image to be recognized based on a network model constructed based on deep lab-v3 in combination with an image segmentation technology may be as follows: firstly, preparing training data, wherein an image data set and corresponding labels are prepared, the size of a pixel corresponding to the label of each image is consistent with that of an original image, the pixel value corresponding to the label is 0 to represent an image background, the pixel value corresponding to the label is 255 to represent an image foreground, a deep Lab-v3 network is built, the training data is input, each pixel is subjected to binary prediction, loss between a predicted value and a real pixel value of the label is calculated, then, the prediction is carried out in a reverse direction, parameters in the network are updated until an accuracy evaluation index mean-IOU of a network model reaches a preset value. Identifying pixels in an image to be identified based on the constructed network model to obtain marks of a foreground and a background in the image, namely identifying the marks as an image background with a pixel value of 0, identifying the foreground marked as the image with a pixel value of 255, segmenting the image to be identified based on the marks of the background, and removing the background of the image to be identified.

The image to be recognized after the preprocessing can better express the image characteristics, so that the image to be recognized can be used as a target image to further recognize texts in the image.

102. And determining the position information of the text region in the target recognition image based on a pre-trained text region detection model.

The pre-trained Text box detection model can use an open source detection Text in natural image with connection Text forward Network (CTPN) frame, each image identification image has a corresponding output file through the Text region detection model, the output file stores position information obtained by Detecting all Text regions, and the position information of the Text region in the target identification image can be determined through the output file.

The specific process of training the text region detection model may be as follows: firstly, training data, namely an image and a label file corresponding to the image are prepared, coordinate information of a text region in the image is stored in the label file, in order to facilitate the subsequent detection of the text region in the image, before each training data is input into a CTPN network, the coordinate information marked on the text region needs to be converted into a small anchor with the width of 8, the CTPN network structure adopts the form of CNN + BLSTM + RPN, the CNN is used for extracting the spatial characteristics of a receptive field in the image, the receptive field is a region of a certain node which outputs a feature map (convolved by a convolution kernel) and corresponds to the response of the input image, the BLSTM can generate horizontal text sequence characteristics based on the spatial characteristics of the receptive field, the RPN comprises two parts, the anchor classification and bounding box regression, after the anchor classification, whether each region is a text region or not can be determined, and after the bounding box regression processing is carried out, a set of vertically striped candidate text lines is obtained.

It should be noted that, the text region output by the pre-trained text region detection model is not directly the text region in the target recognition image, but is a set of candidate text lines in vertical stripes forming the text region in the target recognition image, and the text region in the target recognition image and the position information of the text region may be determined by connecting the set of candidate text lines in vertical stripes to form the text region by using a text line construction algorithm.

103. And inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region.

The Text Recognition model can be trained by adopting An End-to-End variable Neural Network for imaged-based Sequence Recognition and Its Application to Scene Text Recognition (CRNN) algorithm, and the position information of the Text region in the target Recognition image and the target Recognition image is output as a Text Recognition result corresponding to each Text region after passing through the Text Recognition model.

The process of specifically training the CRNN model may be as follows: firstly, training data are stored in a mode of labeling text information of an image and a text region in the image, a CRNN structure adopts a mode of CNN + RNN + CTC, the CNN is used for extracting spatial characteristics of a receptive field in the image, the RNN can predict label distribution of each frame in the image based on the spatial characteristics of the receptive field, and the CTC can integrate the label distribution of each frame and the like to form a final label sequence. For example, the size of the input picture resize to W × 32, and the predicted value output by the text recognition model represents text information corresponding to a text region in the target recognition image.

104. And structuring the text information in the text area to obtain a text field with a mapping relation.

There may be various types included in the text information in the text region due to the recognition model, for example. The text type, the number type, the special character type and the like, and mapping relations may exist between the text type, the number type, the special character type and the like, for example, a text region of 37.5 yuan corresponding to a text region of 'payment amount' and a text region of 'woman' corresponding to a text box of 'gender', and the mapping relations between the text information can be clearly seen by performing structured output on the text information.

Specifically, text information identified by a specific text region in the image can be used as a key field, a fuzzy region corresponding to the text field with a mapping relationship with the key field can be located based on the key field, and then the field with the mapping relationship with the key field is locked in a region range, and then the text information in the region range is verified, so that the text information with the mapping relationship with the key field is confirmed.

The image text recognition method provided by the embodiment of the invention obtains an image to be recognized, preprocesses the image to be recognized to obtain a target recognition image, determines the position information of a text region in the target recognition image based on a pre-trained text region detection model, recognizes and obtains the text information in the text region after inputting the target recognition image and the position information of the text region in the target recognition image into the pre-trained text recognition model, and structures the text information in the text region into a text field with a mapping relation. Compared with the image text recognition method in the prior art, the method has the advantages that the recognized text information is subjected to structural processing, so that the interference information in the image can be effectively removed, the text information of the image is accurately reserved, the image detected in the text area and recognized by the text information is not interfered by the background, the corresponding relation of different fields is output and realized, and the accuracy of image text recognition is improved.

The embodiment of the invention provides another image text recognition method, which can perform security authentication on a service calling party and ensure the security of service calling, and as shown in fig. 2, the method comprises the following steps:

201. and acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image.

For the embodiment of the present invention, the process of specifically acquiring the image to be recognized, and preprocessing the image to be recognized to obtain the target recognition image may refer to the content described in step 101, which is not described herein again.

202. And performing text region labeling on the collected image sample data, and inputting the image sample data into a network model for training to obtain a text region detection model.

The image sample data may be images collected from different scenes, and may reflect image features in different scenes, for example, image features in the medical field are relatively simple or gradually changed, an image background scene in an industrial field is relatively complex and a text region is relatively small, an image background in a natural scene is relatively highly influenced by natural factors, and the background complexity is difficult to predict.

It can be understood that in the general target detection of any image in any scene, each target has a definite closed boundary, and since a text line or word in the image is composed of many individual characters or strokes, such a definite boundary may not exist, a text region in the image needs to be detected first, and specifically, a text region included in each image in image sample data may be labeled, and an image sample data after the labeling is trained to construct a text region detection model, and the text region in the image is detected by using the text region detection model, so as to identify a text in the image.

For the embodiment of the invention, the network model can adopt a CTPN network frame and comprises a 3-layer structure, the first layer is a convolution structure, namely a CNN structure, and spatial information of a receptive field can be learned by extracting image region characteristics corresponding to image sample data through convolution layers; the second layer is a decoding layer, namely a BLSTM structure, and generates horizontal text sequence characteristics according to image region characteristics corresponding to image sample data through the decoding layer, so that the sequence characteristics of horizontal texts can be better dealt with; and the third layer is a prediction layer, namely an RPN structure, determines a text region in the image sample data according to the horizontal text sequence characteristics through the prediction layer, and processes the text region to obtain candidate text lines.

Specifically, the prediction layer of the network model comprises a classification part and a regression part, and in the process of determining a text region in the image sample data according to the horizontal text sequence characteristics through the prediction layer of the network model and processing the text region to obtain candidate text lines, the classification part of the prediction layer of the network model can classify each region in the image sample data according to the horizontal text sequence characteristics to determine the text region in the image sample data; and performing frame regression processing on the text region in the image text data through a regression part of a prediction layer of the network model to obtain candidate text lines.

In the specific implementation process, in the convolutional layer part, the CTPN may select feature maps of conv5 in the VGG model as the final feature of the image, where the size of the feature maps is H × W × C; then, due to the sequence relation among texts, a 3 × 3 area around each point on the feature maps can be extracted by adopting a 3 × 3 sliding window at the decoding layer to be used as the feature vector representation of the point, at the moment, the size of the image is changed into H × W × 9C, then each line is used as the length of the sequence, the height is used as the batch _ size, a 128-dimensional Bi-LSTM is transmitted in, and the output of the decoding layer is W × H × 256; and finally, outputting a decoding layer and accessing the decoding layer into a prediction layer, wherein the prediction layer comprises two parts, namely anchor classification and bounding box regression, whether each region in the image is a text region can be determined through the anchor classification, and a group of vertical strip-shaped candidate text lines can be obtained after the bounding box regression processing and carry a label of whether the candidate text lines are the text regions.

Further, in order to ensure the accuracy of the prediction of the trained text region detection model, the preset loss function can perform parameter adjustment on the multilayer structure in the text region detection model based on the deviation between the result output by the text region detection model and the data labeled by the real text region. For the embodiment of the invention, the pre-trained loss function comprises 3 parts, wherein the first part is a loss function for detecting whether the Anchor is a text region; the second part is a loss function for detecting regression of the anchor's y-coordinate offset; the third part is the loss function of the x-coordinate offset regression used to detect Anchor.

203. And determining the position information of the text region in the target recognition image based on a pre-trained text region detection model.

It can be understood that each image has a corresponding output file through the text region detection model, the output file stores the position information of all candidate text lines in the image and whether the candidate text lines are labels of the text regions, the candidate text lines are equivalent to vertical strip lines split from the text regions, and the position information of the text regions in the target image can be determined by seeing the connection of the candidate text lines.

204. And inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region.

It can be understood that the trained text recognition model has the capability of recognizing the text information in the text region, and since the sample image and the position information labels of the text region in the sample image are used in the process of training the text recognition model, the parameters of the text recognition model are continuously adjusted through forward propagation and reverse deviation correction, so that the text information in the text region can be accurately recognized through the image of the text recognition model.

205. And selecting a preset field from the text information in the text area as a key field, and acquiring the position information of the text area corresponding to the key field.

Because the key field is usually a field with reference value in the image, for the image of the invoice class, fields such as invoice number, total amount, medical insurance type, quantity, date and time and the like can be selected as preset fields and used as the key field, and the coordinate information corresponding to the key field can be obtained through a text region detection model.

206. And determining a fuzzy region having a mapping relation with the key field according to the position information of the text region corresponding to the key field.

For the embodiment of the present invention, in order to accurately locate the text information mapped to the key field, since the text information mapped to the key field is usually located at one side of the key field, the text information mapped to the key field can be located within a fuzzy area by determining a fuzzy area mapped to the key field, for example, the coordinate information of the text area corresponding to the key field can be [ (Xmin, Ymin), (Xmax, Ymax) ], where (Xmin, Ymin) is the coordinate of the upper left corner and (Xmax, Ymax) is the coordinate of the lower right corner, and then the coordinate information of the fuzzy area mapped to the key field can be [ (Xmin, Ymin +2/3(Ymax-Ymin)), (Xmin +1/2(Xmax-Xmin), Ymax) ], where (Xmin, Ymin +2/3(Ymax-Ymin)) is the coordinate of the upper left corner, (Xmin +1/2(Xmax-Xmin), Ymax) is the lower right corner coordinate.

Specifically, the position information of the moved text region, for example, the distance moved by 1/2(Xmax-Xmin) in the horizontal direction and the distance moved by 2/3(Ymax-Ymin) in the vertical direction, may be obtained according to the position information of the text region corresponding to the key field by moving the text region corresponding to the key field by preset distances in the horizontal and vertical directions, and since the moved text region may not cover the text information having a mapping relationship with the key field, the text region after the preset distance is moved is subjected to an amplification process based on the position information of the moved text region, and a fuzzy region having a mapping relationship with the key field is determined.

207. And detecting and inquiring the text information identified in the fuzzy area, and confirming the text information having a mapping relation with the key field.

The method specifically includes detecting position information of all text regions located in the fuzzy region, extracting text information of all text regions in the fuzzy region, and further traversing the text information of each text region in the fuzzy region in a regular matching mode to confirm the text information having a mapping relation with the key field because the text information of all text regions in the fuzzy region is not the text information having a mapping relation with the key field.

In the embodiment of the present invention, in the process of traversing the text information of each text region in the fuzzy region and confirming the text information having a mapping relationship with the key field in a regular seven-match manner, the regular expression matching the key field may be constructed, specifically, by acquiring the pattern character applicable to the key field, for example, in the case of the total amount as the key field, the pattern character applicable to the total amount may include whole zero, one, two, three, four, five, land, seven, eight, ten, hundreds, three, etc., and the regular expression applicable to the total amount may be, but not be, "([ whole zero two three four seven {1,2} ten) {0,1} ([ whole zero two four seven } four) {0,1} ([ whole zero three four seven } four {0,1, 2} four } (" four {1, 1 }),1 },two two four [ two four } four {0,2} ten {0,1} zero {0,1} ([ Whole zero two three four land seven Ba ] {0,2} [ Yuyuan ]) {0,1} ([ Whole zero two three land seven Ba ] {1,2} angle) {0,1} ([ Whole zero two three land four seven Ba ] {1,2} fraction) {0,1} whole {0,1} ". And according to the regular expression matched with the key fields, checking the text information of each text region in the fuzzy region, and confirming the text information having a mapping relation with the key fields, so that the text information meeting the requirement of the regular expression is the text information having the mapping relation with the key fields.

It should be noted that, because field formats and types in different text information are different, some fields are structured and confirmed in a manner suitable for regular matching, and structured and confirmed in other fields may adopt a coordinate calculation manner, specifically, based on the coordinates of the lower left corner of a key field, the height of a text region of the key field is used as a calculation unit of point movement, a text region closest to this point is calculated as a text region of the next key field, and a text field having a mapping relationship with the key field is further confirmed based on the text region of the next key field.

Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides an image text recognition apparatus, and as shown in fig. 3, the apparatus includes: an acquisition unit 31, a determination unit 32, a recognition unit 33, and a processing unit 34.

The acquiring unit 31 may be configured to acquire an image to be identified, and pre-process the image to be identified to obtain a target identification image;

a determining unit 32, configured to determine location information of a text region in the target recognition image based on a pre-trained text region detection model;

the recognition unit 33 may be configured to input the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model, so as to obtain text information in the text region;

the processing unit 34 may be configured to perform structuring processing on the text information in the text region to obtain a text field with a mapping relationship.

The image text recognition device provided by the embodiment of the invention obtains the image to be recognized, preprocesses the image to be recognized to obtain the target recognition image, determines the position information of the text region in the target recognition image based on the pre-trained text region detection model, recognizes the text information in the text region after inputting the target recognition image and the position information of the text region in the target recognition image into the pre-trained text recognition model, and structures the text information in the text region into the text field with the mapping relation. Compared with the image text recognition method in the prior art, the method has the advantages that the recognized text information is subjected to structural processing, so that the interference information in the image can be effectively removed, the text information of the image is accurately reserved, the image detected in the text area and recognized by the text information is not interfered by the background, the corresponding relation of different fields is output and realized, and the accuracy of image text recognition is improved.

As a further description of the image text recognition apparatus shown in fig. 3, fig. 4 is a schematic structural diagram of another image text recognition apparatus according to an embodiment of the present invention, and as shown in fig. 4, the processing unit 34 includes:

a selecting module 341, configured to select a preset field from the text information in the text region as a key field, and obtain location information of the text region corresponding to the key field;

the determining module 342 may be configured to determine, according to the position information of the text region corresponding to the key field, a fuzzy region having a mapping relationship with the key field;

the detecting module 343 may be configured to detect and query the text information identified in the fuzzy area, and confirm the text information having a mapping relationship with the key field.

Further, the determining module 342 comprises:

the obtaining sub-module 3421 may be configured to move the text region corresponding to the key field by a preset distance in the horizontal and vertical directions, and obtain the position information of the moved text region according to the position information of the text region corresponding to the key field;

the determining sub-module 3422 may be configured to perform amplification processing on the post-text region after the preset distance is moved based on the position information of the post-text region after the movement, and determine a fuzzy region having a mapping relationship with the key field.

Further, the detection module 343 includes:

an extracting sub-module 3431, configured to detect position information of all text regions located in the blurred region, and extract text information of all text regions in the blurred region;

the confirming sub-module 3432 may be configured to traverse the text information of each text region in the fuzzy region in a regular matching manner, and confirm the text information having a mapping relationship with the key field.

Further, the confirmation sub-module 3432 may be specifically configured to construct a regular expression matched with the key field by obtaining the mode character applicable to the key field;

the determining sub-module 3432 may be further configured to verify the text information of each text region in the fuzzy region according to the regular expression matched with the key field, and determine the text information having a mapping relationship with the key field.

Further, the apparatus further comprises:

the training unit 35 may be configured to, before determining the position information of the text region in the target recognition image based on the pre-trained text region detection model, perform text region labeling on collected image sample data, and then input the image sample data into a network model for training to obtain a text region detection model;

the network model includes a multi-layer structure, and the training unit 35 includes:

the extraction module 351 may be configured to extract image region features corresponding to image sample data through a convolution layer of the network model;

a generating module 352, configured to generate, by a decoding layer of the network model, a horizontal text sequence feature according to an image region feature corresponding to image sample data;

the prediction module 353 may be configured to determine a text region in the image sample data according to the horizontal text sequence feature through a prediction layer of the network model, and process the text region to obtain a candidate text line.

Further, the prediction layer of the network model comprises a classification part and a regression part, and the prediction module 353 comprises:

a classification sub-module 3531, configured to classify, by a classification portion of a prediction layer of the network model, each region in the image sample data according to the horizontal text sequence feature, and determine a text region in the image sample data;

the processing sub-module 3532 may be configured to perform border regression processing on a text region in the image text data through a regression portion of a prediction layer of the network model, so as to obtain candidate text lines.

It should be noted that other corresponding descriptions of the functional units related to the image text recognition apparatus provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.

Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the image text recognition method shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 3 and fig. 4, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the image text recognition method shown in fig. 1 and 2.

Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

Those skilled in the art will appreciate that the physical device structure of the image text recognition apparatus provided in the present embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the computer device described above, supporting the operation of information handling programs and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. Compared with the prior art, the technical scheme has the advantages that the text information after identification is subjected to structural processing, so that the interference information in the image can be effectively removed, the text information of the image is accurately reserved, the image detected in the text area and identified by the text information is not interfered by the background, the corresponding relation of different fields is output, and the accuracy of image text identification is improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. An image text recognition method, characterized in that the method comprises:

2. The method according to claim 1, wherein the structuring the text information in the text area to obtain a text field with a mapping relationship specifically includes:

3. The method according to claim 2, wherein the determining, according to the position information of the text region corresponding to the key field, the fuzzy region having a mapping relationship with the key field specifically includes:

4. The method according to claim 2, wherein the detecting and querying the text information identified in the fuzzy area and confirming the text information having a mapping relationship with the key field specifically comprises:

5. The method according to claim 4, wherein traversing the text information of each text region in the fuzzy region in a regular matching manner to confirm the text information having a mapping relationship with the key field specifically includes:

6. The method according to any one of claims 1-5, wherein before determining the location information of the text region in the target recognition image based on the pre-trained text region detection model, the method further comprises:

7. The method according to claim 6, wherein the prediction layer of the network model includes a classification part and a regression part, and the determining, by the prediction layer of the network model, the text region in the image sample data according to the horizontal text sequence feature and processing the text region to obtain candidate text lines includes:

8. An image text recognition apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.