Disclosure of Invention
In view of the above, the present invention provides an image text recognition method, an image text recognition device, a computer device, and a computer storage medium, and mainly aims to solve the problem that a field recognized in an image at present cannot correspond to a required field, and is inconvenient for subsequent text information to be used.
According to an aspect of the present invention, there is provided an image text recognition method, including:
acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image;
determining the position information of the text region and the classification label of the text region in the target recognition image based on a pre-trained text region detection model;
inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region under each classification label to form a text field with classification identification;
and checking the text fields with the classification identifications by using a pre-constructed text checking library corresponding to each classification identification.
Further, before the text verification library corresponding to each pre-constructed classification identifier is used to verify the text field with the classification identifier, the method further includes:
converting the collected dictionary samples into code expression, inputting the code expression into a first network model for training to obtain a text mapping model;
the first network model comprises a multilayer structure, the collected dictionary samples are converted into coded representations and then input into the first network model for training to obtain a text mapping model, and the method specifically comprises the following steps:
performing probability prediction on the text field samples represented by the codes through an input layer of the first network model to generate probability distribution of each text field sample;
training each text field sample as the weight of an output field according to the probability distribution of each text field sample through a hidden layer of the first network model to obtain a mapping matrix of the text field;
and adjusting the weight in the mapping matrix of the text field through the full connection layer of the first network model to obtain a text mapping model.
Furthermore, the representation form of the text field under each classification identifier in the text verification library corresponding to each pre-constructed classification identifier is a vector value; before the text verification library corresponding to each pre-constructed classification identifier is used for verifying the text field with the classification identifier, the method further includes:
and coding and representing the text field with the classification identification by using a pre-trained text mapping model to obtain a vector value of the text field with the classification identification.
Further, the verifying the text field with the classification identifier by using a text verification library corresponding to each pre-constructed classification identifier specifically includes:
similarity matching is carried out on the vector value of the text field with the classification identification and the vector value corresponding to the text field under the corresponding classification identification in the text check library;
and checking the text field with the classification identification according to the value obtained by the similarity matching.
Further, the checking the text field with the classification identifier according to the value obtained by the similarity matching specifically includes:
if the value obtained by the similarity matching is larger than a preset threshold value, outputting the text field with the classification identification as a text recognition result;
and if the value obtained by the similarity matching is smaller than or equal to a preset threshold value, outputting the text field with the classification identification as a text verification result.
Further, before determining the location information of the text region and the classification label of the text region in the target recognition image based on the pre-trained text region detection model, the method further includes:
carrying out text region labeling and classification labeling on the collected image sample data, and inputting the image sample data into a second network model for training to obtain a text region detection model;
the second network model comprises a multilayer structure, the collected image sample data is input into the second network model for training after text region labeling and classification labeling are carried out on the collected image sample data, and a text region detection model is obtained, and the method specifically comprises the following steps:
extracting image area features corresponding to the image sample data through the convolution layer of the second network model;
determining a text region in the image sample data by using a multi-scale candidate text box through a prediction layer of the second network model to predict a boundary box of image region characteristics corresponding to the image sample data;
classifying categories of the text regions in the image sample data according to the classification labels of the text regions through a logistic regression layer of the second network model to obtain position information of the text regions and the classification labels of the text regions.
Further, after the text region labeling and classification labeling are performed on the collected image sample data and then the collected image sample data is input into a second network model for training, so as to obtain a text region detection model, the method further comprises the following steps:
and carrying out text region labeling and classified labeled labeling data by using image sample data, and carrying out parameter adjustment on the multilayer structure in the text region detection model by adopting a preset loss function.
According to another aspect of the present invention, there is provided an image text recognition apparatus, the apparatus including:
the device comprises an acquisition unit, a pre-processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be identified and pre-processing the image to be identified to obtain a target identification image;
a determining unit, configured to determine, based on a pre-trained text region detection model, position information of a text region in the target recognition image and a classification label of the text region;
the processing unit is used for inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region under each classification label and form a text field with the classification label;
and the checking unit is used for checking the text fields with the classification identifications by using a pre-constructed text checking library corresponding to each classification identification.
Further, the apparatus further comprises:
the first training unit is used for converting collected dictionary samples into code representations and inputting the code representations into a first network model for training to obtain a text mapping model before the text fields with the classification identifications are verified by using the pre-constructed text verification libraries corresponding to the classification identifications;
the first training unit includes:
the generating module is used for carrying out probability prediction on the text field samples represented by the codes through an input layer of the first network model to generate probability distribution of each text field sample;
the training module is used for training each text field sample as the weight of an output field according to the probability distribution of each text field sample through the hidden layer of the first network model to obtain a mapping matrix of the text field;
and the adjusting module is used for adjusting the weight in the mapping matrix of the text field through the full connection layer of the first network model to obtain a text mapping model.
Further, the representation form of the text field under each classification identifier in the text check library corresponding to each pre-constructed classification identifier is a vector value, and the apparatus further includes:
and the coding unit is used for coding and representing the text field with the classification identification by using a pre-trained text mapping model before checking the text field with the classification identification by using the pre-constructed text checking library corresponding to each classification identification to obtain the vector value of the text field with the classification identification.
Further, the verification unit includes:
the matching module is used for matching the similarity of the vector value of the text field with the classification identifier with the vector value corresponding to the text field under the corresponding classification identifier in the text check library;
and the checking module is used for checking the text field with the classification identification according to the value obtained by the similarity matching.
Further, the verification module is specifically configured to output the text field with the classification identifier as a text recognition result if the value obtained by the similarity matching is greater than a preset threshold;
the verification module is specifically configured to output the text field with the classification identifier as a text verification result if the value obtained by the similarity matching is smaller than or equal to a preset threshold.
Further, the apparatus further comprises:
the second training unit is used for performing text region labeling and classification labeling on the collected image sample data before determining the position information of the text region and the classification label of the text region in the target recognition image based on the pre-trained text region detection model, and inputting the image sample data into the second network model for training to obtain the text region detection model;
the second training unit comprises:
the extraction module is used for extracting image area characteristics corresponding to the image sample data through the convolution layer of the second network model;
the prediction module is used for predicting a boundary box of image region characteristics corresponding to the image sample data by using the multi-scale candidate text box through a prediction layer of the second network model to determine a text region in the image sample data;
and the classification module is used for classifying the categories of the text regions in the image sample data according to the classification labels of the text regions through the logistic regression layer of the second network model to obtain the position information of the text regions and the classification labels of the text regions.
Further, the apparatus further comprises:
and the adjusting unit is used for performing text region labeling and classification labeling on the collected image sample data, inputting the image sample data into the second network model for training to obtain a text region detection model, performing text region labeling and classification labeled labeling data by using the image sample data, and performing parameter adjustment on the multilayer structure in the text region detection model by adopting a preset loss function.
According to yet another aspect of the present invention, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the image text recognition method when the processor executes the computer program.
According to a further aspect of the invention, a computer storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the image text recognition method.
By means of the technical scheme, the invention provides an image text recognition method and device, position information of a text region in a target recognition image and a classification label of the text region are determined based on a pre-trained text region detection model, the text information in the text region under each classification label is obtained by inputting the target recognition image and the position information of the text region in the target recognition image into the pre-trained text recognition model, text fields with classification labels are formed, a text check library corresponding to each classification label is pre-constructed, and the text fields with the classification labels are checked, so that the fields obtained by recognition correspond to required fields, and the use of the text information in the image is facilitated. Compared with the image text recognition method in the prior art, the text information recognized in the text region is verified by pre-constructing the text verification library corresponding to each classification identifier, and because the text verification library comprises all text fields under the classification identifiers, wrong fields in the text information are corrected, so that the utilization rate of the text information in the image is improved.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides an image text recognition method, which can verify a text recognition result and improve the utilization rate of image text information, and as shown in figure 1, the method comprises the following steps:
101. and acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image.
The image to be identified can be an invoice image, an advertisement image, a commodity image and the like. The preprocessing process performed on the image to be recognized herein may include, but is not limited to, graying the image, extracting the region of interest of the image, and correcting the direction of the characters in the image.
It can be understood that the color image does not reflect morphological features of the image, and in order to simplify image information in the image to be recognized, the image to be recognized may be subjected to gray processing, specifically, the RGB in the image to be recognized may be subjected to gray processing, and a 24-bit true color image may be converted into an 8-bit gray image, so as to greatly reduce the dimensionality of the image.
It can be understood that, because the image to be recognized usually contains interferences such as background, these regions may not be the regions of interest of the user, and in order to extract the regions of interest of the user in the image to be recognized, the regions of interest in the image to be recognized are specifically recognized by a region recognition model trained by the fastercnn algorithm, so as to segment the regions of interest from the image to be recognized.
The network model constructed by the faster rcnn algorithm specifically comprises a convolutional layer, an RPN network, ROL posing, classification and regression, and the fast rcnn algorithm is realized by the following process: firstly, extracting image characteristics of an image to be identified through a convolution layer of a network model, inputting the image characteristics into a whole image, and outputting the image characteristics into extracted characteristic feature maps; then recommending candidate regions through an RPN network of a network model, inputting the candidate regions into a whole image, and outputting the candidate regions into a plurality of candidate regions; converting the input candidate areas with different sizes into unfixed length and outputting the unfixed length through ROL posing of the network model; and finally, outputting the class to which the candidate region belongs and the accurate position coordinates of the candidate region in the image through classification and regression of the network model.
Specifically, in the process of training the region division model by using the faster rcnn algorithm, a large amount of image sample data can be collected to manually mark a plurality of regions of interest in the image sample data according to a preset marking mode, the image sample data is input into the network model, the network model forwards propagates and outputs box4 coordinate positions, and training parameters of each network layer are adjusted until the training is finished by calculating the continuous backward propagation error of a loss function.
It can be understood that the user does not consider the problems of the direction of the image, the shooting angle and the like when uploading the image to be recognized, the correction angle of the character direction needs to be regulated, specifically, the picture can be corrected by adopting hough transform, and the principle is that a curve or a straight line with a shape in a rectangular coordinate system where the image to be recognized is located is mapped to a point in the hough space to form a peak value by utilizing the change between the space where the image is located and the hough space, so that the problem of detecting any shape is converted into the problem of calculating the peak value. That is, a straight line in the rectangular coordinate system where the image to be recognized is located is converted into the hough space to form a point, and the point is formed by intersecting a plurality of straight lines, and the statistical peak value is the number of intersecting lines of the intersecting point.
The image to be recognized after the preprocessing can better express the image characteristics, so that the image to be recognized can be used as a target image to further recognize texts in the image.
102. And determining the position information of the text region and the classification label of the text region in the target recognition image based on a pre-trained text region detection model.
The pre-trained text region detection model uses yolov3 algorithm, each image has a corresponding output file when passing through the text region detection model, and the output file stores the text region in the target recognition image, the position information of the text region and the classification label.
The network model constructed by the specific Yolov3 algorithm uses Darknet-53, logistic regression, category prediction and multi-scale prediction, and the specific implementation process comprises the following steps: 1. the characteristics are extracted through a Darknet-53 network, and the network is formed by overlapping residual error units, so that the classification accuracy and efficiency are balanced, and the network can be better than networks such as Resnet-101 and Resnet-152; 2. the score of each bounding box is predicted using logistic regression, which is 1 if the a priori bounding box overlaps the true box better than any other bounding box before, and the prediction is ignored if the a priori bounding box is not the best, but does overlap its true object by more than some threshold. The Yolov3 algorithm allocates a bounding box for each real object, if the prior bounding box is not matched with the real object, the coordinate or category prediction loss is not generated, and only the object prediction loss is generated; 3. class prediction, in order to realize multi-label classification, the model does not use soft max function as a final classifier, but uses Logistic as a classifier and binary macro ss-entropy as a loss function; 4. and (3) multi-scale prediction, wherein 3 boxes are predicted in each scale, clustering is used in the design mode of the anchor to obtain 9 clustering centers, and the clustering centers are uniformly distributed to 3 scales according to the sizes.
The specific process of training the text region detection model may be as follows: firstly, training data are prepared, manual framing and classification labeling are carried out on text regions of interest in an image according to a preset labeling mode, such as hospital names, money amounts, items, quantity and the like, label files are obtained, coordinate information of the text regions in the image and classification labels of the text regions are stored in the label files, after each training data is input into a network model, the network model forwards transmits and transmits the region types and 4 coordinates of the regions, errors are continuously transmitted in a reverse direction through calculating loss functions, and training parameters of each network layer are adjusted until training is finished.
103. And inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region under each classification label, and forming a text field with classification identification.
The Text Recognition model can be trained by adopting An End-to-End variable Neural Network for imaged-based Sequence Recognition and Its Application to Scene Text Recognition (CRNN) algorithm, and the position information of the Text region in the target Recognition image and the target Recognition image is output as a Text Recognition result corresponding to each Text box after passing through the Text Recognition model. Since each text region corresponds to a classification label, a text field with a mapping relationship can be formed by performing structuring processing on the classification label and text information in the text region under the classification label.
The process of specifically training the CRNN model may be as follows: firstly, training data are stored in a mode of labeling text information of an image and a text region in the image, a CRNN structure adopts a mode of CNN + RNN + CTC, the CNN is used for extracting spatial characteristics of a receptive field in the image, the RNN can predict label distribution of each frame in the image based on the spatial characteristics of the receptive field, and the CTC can integrate the label distribution of each frame and the like to form a final label sequence. For example, the size of the input picture resize to W × 32, and the predicted value output by the text recognition model represents text information corresponding to a text region in the target recognition image.
After the text recognition model is used for recognition, the text information of the text area under each classification label, namely the text field with the classification identification, can be obtained. For example, the text information in the text region corresponding to the hospital classification tag is "jiangsu province third hospital", the text information in the text region corresponding to the examination item classification tag is "western medicine", "injection fee", "nursing fee", "emergency call fee", etc., and the text information in the text region corresponding to the amount classification tag is "jia qia bai wu jie one yuan jia jiu score".
104. And checking the text fields with the classification identifications by using a pre-constructed text checking library corresponding to each classification identification.
Since there are some fixed discrete value field information in the outpatient invoice, such as medical insurance type, sex, hospital name, etc. In order to improve the accuracy of structured data, a word embedding algorithm is adopted here to map text information in a text region to corresponding values in a word stock, and text fields with classification identifications are verified by utilizing a text verification library corresponding to each pre-constructed classification identification.
The specific word embedding algorithm can be understood as a mapping algorithm, and the process is to map or embed a certain word in a text space into another numerical vector space by a certain method. CBOW predicts word through context, and the process is as follows:
(1) the input is the vector of C V dimensions. Where C is the size of the context window and V is the size of the original coding space. For example, C-2 and V-4 in the example two vectors are one-hot encoded versions of He and is in 4 dimensions, respectively;
(2) the activation function is quite simple, each input vector is multiplied by a matrix with VxN dimensionalities between the input layer and the hidden layer, and each dimensionality of the obtained vector is averaged to obtain the weight of the hidden layer. Multiplying the hidden layer by an NxV dimensional matrix to obtain the weight of the output layer;
(3) the dimension of the hidden layer is set as the dimension of the word vector after compression in the ideal. In the example, suppose we want to compress the original 4-dimensional original one-hot encoding dimension to 2-dimensional, then N equals 2;
(4) the output layer is a softmax layer for combining the output probabilities. The so-called penalty function, which is the difference between this output and target (the difference between the V-dimensional vector of output and the one-hot encoded vector of input vector), the purpose of the neural network is to minimize this loss;
(5) after the optimization is finished, the N-dimensional vector of the hidden layer can be used as a Word-Embedding result.
And obtaining the dense word vector which carries the context information and is compressed through a word embedding algorithm.
The process of specifically constructing the text check library corresponding to each classification identifier may include: firstly, selecting training data of each classification mark, such as a hospital name library, converting all independent characters of all hospital names into vector words, checking a project library, and converting all independent characters of all checked project names into word vectors; then, outputting training data represented by the vector words to a network model, and transmitting and outputting a sequence string corresponding to the predicted character string in the forward direction by the network model; and adjusting the training parameters of each network layer by calculating the continuous reverse transfer error of the softmax loss function until the training is finished to form a text check library corresponding to each classification identifier.
The image text recognition method provided by the embodiment of the invention determines the position information of a text region in a target recognition image and a classification label of the text region based on a pre-trained text region detection model, obtains the text information in the text region under each classification label by inputting the target recognition image and the position information of the text region in the target recognition image into the pre-trained text recognition model, forms a text field with classification marks, and verifies the text field with the classification marks by utilizing a pre-constructed text verification library corresponding to each classification mark, so that the field obtained by recognition corresponds to a required field, and the text information in the image can be conveniently used. Compared with the image text recognition method in the prior art, the text information recognized in the text region is verified by pre-constructing the text verification library corresponding to each classification identifier, and because the text verification library comprises all text fields under the classification identifiers, wrong fields in the text information are corrected, so that the utilization rate of the text information in the image is improved.
The embodiment of the invention provides another image text recognition method, which can verify a text recognition result and improve the utilization rate of image text information, and as shown in fig. 2, the method comprises the following steps:
201. and acquiring an image to be recognized, and preprocessing the image to be recognized to obtain a target recognition image.
For the embodiment of the present invention, the process of specifically acquiring the image to be recognized, and preprocessing the image to be recognized to obtain the target recognition image may refer to the content described in step 101, which is not described herein again.
202. And performing text region labeling and classification labeling on the collected image sample data, and inputting the image sample data into a second network model for training to obtain a text region detection model.
The image sample data may be images collected from different scenes, and may reflect image features in different scenes, for example, image features in the medical field are relatively simple or gradually changed, an image background scene in an industrial field is relatively complex and a text region is relatively small, an image background in a natural scene is relatively highly influenced by natural factors, and the background complexity is difficult to predict.
It can be understood that in the general target detection of any image in any scene, each target has a definite closed boundary, and since a text line or word in the image is composed of many individual characters or strokes, such a definite boundary may not exist, a text region in the image needs to be detected first, and specifically, a text region included in each image in image sample data may be labeled, and an image sample data after the labeling is trained to construct a text region detection model, and the text region in the image is detected by using the text region detection model, so as to identify a text in the image.
For the embodiment of the invention, the second network model can adopt a Yolov3 network framework and comprises a 3-layer structure, the first layer is a convolution layer, and image area characteristics corresponding to image sample data are extracted through the convolution layer; the second layer is a prediction layer, and the prediction layer utilizes the multi-scale candidate text box to predict the boundary box of the image region characteristic corresponding to the image sample data to determine the text region in the image sample data; and the third layer is a logistic regression layer, and the logistic regression layer classifies the categories of the text regions in the image sample data according to the classification labels of the text regions to obtain the position information of the text regions and the classification labels of the text regions.
It can be understood that, in order to ensure the accuracy of prediction of the text region detection model obtained by training, after performing text region labeling and classification labeling on collected image sample data, inputting the image sample data into the second network model for training, after obtaining the text region detection model, a loss function may be preset, and parameter adjustment may be performed on the multilayer structure in the text region detection model based on a deviation between the recognition results output by the text region recognition model and the labeling data after performing text region labeling and classification labeling on the image sample data.
203. And determining the position information of the text region and the classification label of the text region in the target recognition image based on a pre-trained text region detection model.
It can be understood that each image has a corresponding output file through the text region detection model, the output file stores the classification labels and the position information of all candidate text lines in the image, and whether the candidate text lines are the labels of the text regions, where the candidate text lines are equivalent to the vertical strip lines split from the text regions, and the text regions and the classification labels of the text regions in the target image can be determined by connecting the candidate text lines.
204. And inputting the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model to obtain the text information in the text region under each classification label.
It can be understood that the trained text recognition model has the capability of recognizing the text information in the text region, and since the parameters of the text recognition model are continuously adjusted through forward propagation and reverse bias correction by using the sample image and the classification labels of the text region in the sample image in the process of training the text recognition model, the text information in the text region and the classification labels of the text region can be accurately recognized through the image of the text recognition model.
205. And converting the collected dictionary samples into code expression, inputting the code expression into the first network model for training to obtain a text mapping model.
The first network model comprises a multilayer structure, and specifically, the probability prediction can be performed on the text field samples represented by the codes through an input layer of the first network model to generate the probability distribution of each text field sample; training each text field sample as the weight of an output field through a hidden layer of a first network model according to the probability distribution of each text field sample to obtain a mapping matrix of the text field; and adjusting the weight in the mapping matrix of the text field through the full connection layer of the first network model to obtain a text mapping model.
It can be understood that, in order to further improve the accuracy of the text mapping of the first network model, after the collected dictionary samples are converted into the coded representation and input to the first network model for training to obtain the text mapping model, a loss function may also be preset, and based on the deviation between the vector value after the dictionary samples are converted into the coded representation and the text mapping result output by the text mapping model, the parameter adjustment is performed on the multilayer structure in the text mapping model.
206. And coding and representing the text field with the classification identification by using a pre-trained text mapping model to obtain a vector value of the text field with the classification identification.
It is understood that the text mapping model can map the text field into a set of numerical vector spaces, and vector values of the text field with the classification identifier can be obtained by inputting the text field with the classification identifier into the text mapping model.
207. And performing similarity matching on the vector value of the text field with the classification identification and the vector value corresponding to the text field under the corresponding classification identification in the text check library.
For the embodiment of the invention, because the text fields of different classification identifiers have the text check bases, the representation forms of the text fields under the classification identifiers in the text check base corresponding to the pre-constructed classification identifiers are vector values, the recorded text fields in the text check base of the hospital name classification tags may appear as the vector values of various hospital names, and the text check fields of other classification tags are treated in the same way. And performing similarity matching on the vector value of the text field with the classification identifier and the vector value corresponding to the text field under the corresponding classification identifier in the text check library, and further verifying the text field obtained by identification.
208. And checking the text field with the classification identification according to the value obtained by the similarity matching.
For the embodiment of the invention, the value obtained by similarity matching is used for reflecting the similarity between the recognized text field and the accurate text field in the text verification library, if the value obtained by similarity matching is greater than the preset threshold, the text field with the classification identification has higher accuracy, and the text field with the classification identification is further output as the text recognition result; if the value obtained by the similarity matching is smaller than or equal to the preset threshold, it indicates that the text field with the classification identifier has low accuracy, and further outputs the text field with the classification identifier as a text verification result, which needs to be audited manually.
Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides an image text recognition apparatus, and as shown in fig. 3, the apparatus includes: an acquisition unit 31, a determination unit 32, a processing unit 33, and a verification unit 34.
The acquiring unit 31 may be configured to acquire an image to be identified, and pre-process the image to be identified to obtain a target identification image;
a determining unit 32, configured to determine, based on a text region detection model trained in advance, position information of a text region in the target recognition image and a classification label of the text region;
the processing unit 33 may be configured to input the target recognition image and the position information of the text region in the target recognition image into a pre-trained text recognition model, obtain text information in the text region under each classification label, and form a text field with a classification identifier;
the checking unit 34 may be configured to check the text field with the classification identifier by using a text checking library corresponding to each pre-constructed classification identifier. The image text recognition device provided by the embodiment of the invention determines the position information of a text region in a target recognition image and the classification label of the text region based on a pre-trained text region detection model, obtains the text information in the text region under each classification label by inputting the target recognition image and the position information of the text region in the target recognition image into the pre-trained text recognition model, forms a text field with classification marks, and verifies the text field with the classification marks by utilizing a pre-constructed text verification library corresponding to each classification mark, so that the field obtained by recognition corresponds to a required field, and the text information in the image can be conveniently used. Compared with the image text recognition method in the prior art, the text information recognized in the text region is verified by pre-constructing the text verification library corresponding to each classification identifier, and because the text verification library comprises all text fields under the classification identifiers, wrong fields in the text information are corrected, so that the utilization rate of the text information in the image is improved.
As a further description of the image text recognition apparatus shown in fig. 3, fig. 4 is a schematic structural diagram of another image text recognition apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus further includes:
the first training unit 35 may be configured to, before the text field with the classification identifier is verified by using the pre-established text verification library corresponding to each classification identifier, convert the collected dictionary sample into a code representation and input the code representation to the first network model for training to obtain a text mapping model;
the first training unit 35 comprises:
a generating module 351, configured to perform probability prediction on text field samples of the encoded representation through an input layer of the first network model, and generate a probability distribution of each text field sample;
a training module 352, configured to train, according to the probability distribution of each text field sample, each text field sample as a weight of an output field through a hidden layer of the first network model, to obtain a mapping matrix of a text field;
the adjusting module 353 may be configured to adjust the weight in the mapping matrix of the text field through the full connection layer of the first network model, so as to obtain a text mapping model.
Further, the representation form of the text field under each classification identifier in the text check library corresponding to each pre-constructed classification identifier is a vector value, and the apparatus further includes:
the encoding unit 36 may be configured to, before the text field with the classification identifier is verified by using the pre-constructed text verification library corresponding to each classification identifier, encode and represent the text field with the classification identifier by using a pre-trained text mapping model, and then obtain a vector value of the text field with the classification identifier.
Further, the verification unit 34 includes:
a matching module 341, configured to perform similarity matching between the vector value of the text field with the classification identifier and a vector value corresponding to a text field under a corresponding classification identifier in the text check library;
the verification module 342 may be configured to verify the text field with the classification identifier according to the value obtained by the similarity matching;
further, the verification module 342 may be specifically configured to, if the value obtained by the similarity matching is greater than a preset threshold, output the text field with the classification identifier as a text recognition result;
the verification module 342 may be further configured to, if the value obtained by the similarity matching is smaller than or equal to a preset threshold, output the text field with the classification identifier as a text verification result.
Further, the apparatus further comprises:
a second training unit 37, configured to, before determining the position information of the text region and the classification label of the text region in the target recognition image based on the pre-trained text region detection model, perform text region labeling and classification labeling on the collected image sample data, and then input the image sample data into a second network model for training to obtain a text region detection model;
further, the second training unit 37 includes:
an extracting module 371, configured to extract image region features corresponding to image sample data through the convolution layer of the second network model;
a prediction module 372, configured to determine a text region in image sample data by using a multi-scale candidate text box to predict a bounding box of an image region feature corresponding to the image sample data through a prediction layer of the second network model;
the classification module 373 may be configured to classify, by the logistic regression layer of the second network model, the category to which the text region in the image sample data belongs according to the classification label of the text region, so as to obtain the position information of the text region and the classification label of the text region.
Further, the apparatus further comprises:
the adjusting unit 38 may be configured to perform text region labeling and classification labeling on the collected image sample data, input the image sample data into the second network model for training, obtain a text region detection model, perform text region labeling and classification labeled labeling data by using the image sample data, and perform parameter adjustment on the multilayer structure in the text region detection model by using a preset loss function.
It should be noted that other corresponding descriptions of the functional units related to the image text recognition apparatus provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.
Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the image text recognition method shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 3 and fig. 4, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the image text recognition method shown in fig. 1 and 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
Those skilled in the art will appreciate that the physical device structure of the image text recognition apparatus provided in the present embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.
The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the computer device described above, supporting the operation of information handling programs and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme, compared with the prior art, the text information obtained by recognition in the text region is verified by pre-constructing the text verification library corresponding to each classification identifier, and because all text fields under the classification identifiers are contained in the text verification library, the wrong fields in the text information are corrected, so that the utilization rate of the text information in the image is improved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.