CN114581906B - Text recognition method and system for natural scene image - Google Patents
Text recognition method and system for natural scene image Download PDFInfo
- Publication number
- CN114581906B CN114581906B CN202210483188.4A CN202210483188A CN114581906B CN 114581906 B CN114581906 B CN 114581906B CN 202210483188 A CN202210483188 A CN 202210483188A CN 114581906 B CN114581906 B CN 114581906B
- Authority
- CN
- China
- Prior art keywords
- features
- semantic
- image
- visual
- natural scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000000007 visual effect Effects 0.000 claims abstract description 81
- 239000013598 vector Substances 0.000 claims abstract description 25
- 230000004927 fusion Effects 0.000 claims abstract description 20
- 238000013136 deep learning model Methods 0.000 claims abstract description 19
- 238000012937 correction Methods 0.000 claims abstract description 10
- 238000000605 extraction Methods 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 21
- 230000007246 mechanism Effects 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 7
- 230000006403 short-term memory Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 6
- 230000003416 augmentation Effects 0.000 description 5
- 238000009472 formulation Methods 0.000 description 5
- 239000000203 mixture Substances 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of data identification, and discloses a text identification method and a text identification system for natural scene images, wherein the method comprises the following steps: acquiring a natural scene image to be identified; performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text; firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and then respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features. The method can identify the scene texts in any shapes, has wide application scenes and strong generalization capability of the model, and can be applied to various scenes of text identification.
Description
Technical Field
The invention relates to the technical field of data identification, in particular to a text identification method and system of a natural scene image.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Text recognition is one of the branches of the field of computer vision research, belongs to pattern recognition and artificial intelligence, and is an important component of computer science. The method is to recognize the text in the natural scene by using the computer technology and convert the text into a format which can be displayed by a computer and understood by people. Through text recognition, the information processing speed can be greatly increased.
Due to the sequence of the text, a scene text recognition task has a huge semantic gap between a two-dimensional text image and a one-dimensional sequence text, and the scene text recognition has unique difficulty. People recognize that text in a scene is affected by a variety of factors, such as background, color, and even culture.
For example, when a person sees the white-background yellow-symbol marked by mcdonald's mark, he or she can know that mcdonald's writing is mcdonald's work subconsciously, and often, after seeing a scene for the first time, the person can determine what the text is without reflecting what is written. However, these complex scenes are not a side effect for the computer to recognize the text. In reality, scenes are extremely complex and are different in font, style, background and the like, and the interferences cause a computer to frequently and wrongly recognize certain characters, and the recognition result always deviates from the meaning of a text when a certain character is wrongly recognized.
The current text recognition method can be roughly divided into two directions, wherein one direction is to optimize the visual characteristics by utilizing the strong characteristic extraction capability of a neural network in the deep learning era to obtain higher characteristic extraction capability, and the other direction is to perform semantic enhancement on the characteristics extracted by a two-dimensional visual image by a characteristic extractor or correct the result obtained by a visual model according to the semantics of the text.
In recent years, most models with the best effect are semantic-based models, and in such methods, a feature extractor is generally used to extract a two-dimensional picture into a visual feature map, and then the semantic features are obtained by further encoding the feature map extracted by the visual model by using the semantic model. And then comprehensively utilizing the visual features and the semantic features to obtain a final recognition result. But this approach can result in semantic models that are highly dependent on visual features.
This way of coupling semantic features with visual features has two disadvantages: one is that the semantic model is only used to correct the results obtained by the visual model. The semantic model is trained end-to-end in the whole process, but the semantic model is actually independent of the whole model as a post-processing mode, so that the model becomes large, and the gradient chain is difficult to train due to the growth. Secondly, the result of correcting the visual model by using the semantic model does enhance the recognition capability of the model through such a structure, but there are various wrong text information in the natural scene, for example, for recognition and correction of handwritten test paper, and for wrong text, after using the semantic model, the model will automatically correct the wrong text into correct words, which undoubtedly deviates from the intention of correction operation.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a text recognition method and a text recognition system for natural scene images; the semantic independent text recognition network is different from the prior model which only utilizes truncation gradient to decouple the visual and semantic models, and adjusts the model structure to realize complete decoupling on the structure. A new semantic module is designed for fully utilizing semantic information. A new visual semantic feature fusion module is designed, so that the visual features and the semantic features are fully interacted, and the visual features and the semantic features are fully utilized. And fusing the enhanced visual information and the semantic information by using a door mechanism to obtain a final prediction result.
In a first aspect, the invention provides a text recognition method for natural scene images;
the text recognition method for the natural scene image comprises the following steps:
acquiring a natural scene image to be identified;
performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
In a second aspect, the present invention provides a text recognition system for images of natural scenes;
a system for text recognition of images of natural scenes, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
Compared with the prior art, the invention has the beneficial effects that:
1) the method can identify the scene texts in any shapes, has wide application scenes and strong generalization capability of the model, and can be applied to various scenes of text identification.
2) The invention adjusts the model structure, thereby reducing the coupling degree of the semantic module to the visual module, and enabling the visual module to become an independent part, thereby enabling the model to be easier to train end to end without complex multi-stage or pre-training process.
3) The invention provides a new semantic module for text recognition, which can process the semantic information of a text equivalently with a visual module and obtain a preliminary semantic recognition result at the branch for guiding model training.
4) The invention provides a new visual semantic fusion mode, which enables the interaction of equivalent semantic features and visual features, can fully utilize the information of the equivalent semantic features and the visual features, and adopts a door mechanism to make final decision on the information fused by the equivalent semantic features and the visual features, thereby obtaining a better recognition result.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a functional block diagram of the entire network;
fig. 2 is a schematic diagram of feature fusion of visual features and semantic features.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
Example one
The embodiment provides a text recognition method of a natural scene image;
the text recognition method of the natural scene image comprises the following steps:
s101: acquiring a natural scene image to be identified;
s102: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and then respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
Further, as shown in fig. 1, the deep learning model includes: the correction module is connected with the input end of a backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the input end of the semantic feature extraction module; the output end of the visual characteristic extraction module and the output end of the semantic characteristic extraction module are both connected with the input end of the visual semantic characteristic fusion module, the output end of the visual semantic characteristic fusion module is connected with the input end of the prediction module, and the output end of the prediction module outputs a text recognition result.
Further, the rectification module is implemented by using a thin-Plate spline interpolation algorithm tps (thin Plate spline) and is used for rectifying the curved image into a regular shape.
It should be understood that there are widely curved texts in natural scenes, and these huge changes in image background, appearance and layout pose a significant challenge to the text Recognition task, and the conventional Optical Character Recognition (OCR) method cannot effectively handle. It is difficult to train a recognition model capable of dealing with various scenes, so the embodiment corrects the image by using the thin-plate spline interpolation algorithm TPS to obtain a more regular text. The key problem of the thin-plate spline interpolation algorithm TPS is to locate 2J control points and the transformation matrix between the control points, where J is a hyper-parameter. The 2J control points are obtained by means of regression, specifically, after the input image is sent to a convolutional neural network with 3 layers to extract image features, 2J outputs are obtained by using full-connection operation, and the 2J outputs correspond to the control points. The transformation matrix is an analytical solution obtained by a norm between the pixel point and the control point and a norm between the control points. Through the step, the text image after preliminary correction can be obtained.
Furthermore, the backbone network is implemented by adopting a ResNet (residual Neural network) convolutional Neural network, and is used for extracting spatial features of the natural scene image, embedding position information into the spatial features by adopting position coding, and performing feature enhancement on the spatial features after embedding the position information by adopting a first Transformer Neural network to obtain feature vectors of the image.
Wherein, the position code means:
It should be understood that because scene text images usually adopt a deep convolutional neural network as an image encoder due to the style background, font style, and noise in the shooting process, this embodiment selects ResNet45 as a backbone network, and by using residual connection in the network, model degradation caused by the network being too deep is reduced, and the problem of network gradient disappearance is also reduced. In addition, due to the sequence of the text, after the residual network is used, the position coding and the Transformer are used to acquire the enhanced features, the position information of the image can be introduced by using the position coding, so that the neural network can pay more attention to the position relation between the features in the two-dimensional feature map, and for the Transformer, the query here isKey, keyValue ofThe three are self, and the method can deeply dig the relationship of the internal features of the feature map. The formulation is described as follows:
wherein,is the position code obtained in equation (1.1),it is referred to as an input image,refer to a residual neural network that is,the method refers to a Transformer network, and the operation of the step refers to inputting the image into a ResNet45 network to obtain the characteristics, and adding the characteristics and the position codes to send the characteristics to the Transformer network.
Further, the visual feature extraction module separates a visual part and a semantic part of the image feature vector by adopting a second transform neural network, and decodes the visual part by adopting a position attention mechanism module to obtain visual features;
the position attention mechanism module is used for inquiring self-attention mechanism self-attentionKey, keyValue ofReplacing with different elements; wherein the queryIs replaced by position code, keyIs replaced by the output value, value of the UNet networkIs replaced byChange toIdentity mapping of.
The position attention mechanism is shown in formula (1.6) -formula (1.8).
After the visual features obtained by the second transform network are decoded by using a position attention mechanism module, a decoding result is predicted by adopting a full-connection layer, and cross entropy loss in a formula (1.10) is used for supervision training.
Illustratively, the visual feature extraction module, in this embodiment, uses a transform to further extract high-dimensional visual features so as to reduce the dependency of the next decoding on the backbone network, so that the decoding of the semantic part is decoupled from the decoding of the visual part. The visual part is characterized by two-dimensional features and the semantic part is characterized by one-dimensional result sequence, which is the fundamental difference between the two. By this step according to the characteristicsTo obtainThe formula is as follows:
wherein, in the formula (1.5)Are all obtained from the formula (1.4)The identity of the image to be scanned is mapped,is the dimension of the feature that is,is a hyper-parameter which is the parameter,refer toA function. By this step, the visual part is made to extract deeper decoding, so that the decoding of the visual part is separated from the semantic part.
Next, the visual features are converted to character sequences using a positional attention mechanism module, also using the self-attention formula, but different from (1.5)、Andall are used asIs mapped to the identity of (1.8)、Anddifferent coding schemes are used.
Of formula (1.5)And in equation (1.6)In contrast, the query of formula (1.5) is intended toIn encoding the positional relationship of two-dimensional visual features, it is therefore actually adopted that the featuresHerein, howeverThe coding is the sequence relation between the characters in the words, so the word embedding is adopted for the position sequence, and the formula description is as follows:
wherein,a word-embedding function is used that,is the order of characters, including [0,1, … N],The corresponding position code.
Then UNet network is used, where the UNet network does not use character level segmentation to guide training, but as a feature enhancement mode, where new features with the same size as the original feature map are obtained through UNet network, and the formulation is described as follows:
wherein,is a characteristic obtained by the formula (1.5),is a network of UNet's, and,is the resulting bond.Then the identity map is used, the characteristics used。
Similarly, the final characteristic diagram of the visual part is obtained by using self-attention formulaAnd after decoding, obtaining the result of the visual part.
Wherein, in the formula (1.8)Corresponding to that in equation (1.6),Corresponding to that in equation (1.7),Is thatIdentity mapping of, obtainedIs the final feature of the visual part; in the formula (1.8)Is a function of the full connection of the network,is the output obtained in equation (1.8),refer to the dimensions of the space in which,is the maximum length of a word, here a predefined value,is the number of categories of characters, again a predefined value,is the result of the recognition of the visual part,representing the set of whole real numbers.
The present embodiment introduces cross-entropy loss to guide the training formula of the visual part as follows:
wherein,it means that in the formula (1.9)First, theThe number of the time slices is equal to the total number of the time slices,is the label of the real object,what corresponds is the length of the word.It is to perform cross-entropy loss with the true value for each predicted character.
Furthermore, the semantic feature extraction module aligns feature vectors of the images by using an attention mechanism module, and then decodes the aligned data to obtain semantic features.
The alignment process refers to aligning two-dimensional features into one-dimensional features.
The aligned data is decoded and implemented by using a Long Short-Term Memory network (LSTM).
The long-term and short-term memory network adopts a full connection layer to predict semantic features, and uses a second cross entropy loss function to perform supervision training.
Illustratively, because the semantic feature extraction module couples to the visual feature extraction module, training of the semantic feature extraction module greatly depends on the visual feature extraction module, and therefore the semantic feature extraction module is independent in the embodiment, and the features of the backbone network are independently encoded, so that the coupling degree of the model is reduced.
The semantic feature extraction module aligns two-dimensional features obtained by the backbone network by adopting an attention mechanism to enable the two-dimensional features to become well-aligned one-dimensional featuresIn addition, the present embodiment adds the lead-in position information to the attention mechanismSo that the attention mechanism focuses not only on the area with discrimination but also on the position relationship of the text between the images, the formulation is described as follows:
wherein, in the formula (2.1),,Are all of the parameters that are trainable,is the order of characters, including [0,1, … N],What is done is a word-embedding operation,is the eigenvector obtained in equation (1.4); in the formula (2.2)Is referred to asAn exponential function of base, includingThe values in the numbers are the results obtained by equation (2.1), i.e.The index of (a) is,is a preset maximum length of a word, other throughoutAre all the same hyper-parameter,obtained is thatTime of day positionBy multiplying the weight back to the feature sequenceThe aligned one-dimensional sequence features can be obtained by the following formula:
wherein,is the feature vector obtained by equation (1.4),is the weight obtained by equation (2.2),the resulting aligned one-dimensional sequence.
After the aligned one-dimensional sequence features are obtained, decoding the aligned one-dimensional sequence features by using a Long Short Term Memory (LSTM), wherein due to a time factor, decoding is performed in one step after a result is obtained instead of an autoregressive mode, and a specific formula is described as follows:
wherein,,,is three gates corresponding to the LSTMCorrespond toFunction, all ofAndare all parameters that are used for learning,corresponding to the hidden layer state of the last time slice,is the result obtained by the formula (2.3), corresponding toThe input of the time of day is,is thatThe function of the function is that of the function,the correspondence is to a dot-by-dot product,is the output of the LSTM.
And after the full connection is carried out on the identification data, a final identification result is obtained:
wherein,is the output of the LSTM in equation (2.10),is a full-connection layer, and is characterized in that,is the final prediction result of the semantic model,refer to the dimensions of the space in which,is the maximum length of a word, here a predefined value,is the number of categories of characters, again a predefined value,is the decoded visual feature.
Through LSTM decoding, the decoded semantic features are obtained, the recognition result of the semantic part is obtained after alignment and full connection, and similarly, for the semantic model, cross entropy is adopted, the recognition result of the part is supervised, and the formula description is as follows:
wherein,is obtained by the formula (2.10)The predicted value of the time of day,correspond toThe true value of the time of day.Is the length of the character or characters,it is the loss between each predicted character and the real character that is sought.
Further, the visual semantic feature fusion module is used for interacting the visual features and the semantic features to obtain enhanced visual features and enhanced semantic features; and adopting a door mechanism decision to fuse the enhanced visual features and the enhanced semantic features.
Illustratively, the visual features are obtained by decoding the visual feature extraction module and the semantic feature extraction module respectivelyAnd semantic features. In order to fully utilize the features of the two, the present embodiment adopts a new visual semantic feature interaction mode, specifically, the above paradigm of self-attention is also followed, except that the fusion of the visual feature and the semantic feature is used here, so that one feature is used for query during interactionAnother feature is indexedAnd key value,Corresponding to query, i.e. query, and key corresponding to indexValue corresponds to a key value. The graphical representation is as shown in fig. 2, and the specific formulation is described as follows:
wherein,corresponding to the semantic features obtained by equation (2.10),corresponding to the visual characteristics obtained by the formula (1.8),is the dimension of the feature that is,is a hyper-parameter which is a function of,is thatThe function of the function is that of the function,anda visually dominant fused vector and a semantically dominant fused vector.
Through the step, interactive visual features and semantic features are obtained, the visual information and the semantic information are fully utilized in the mode, then, a door mechanism is used for fusing the interactive features of the visual information and the semantic information, and the specific formula is as follows:
wherein, in the formula (3.3)Is a parameter that can be trained in a way that,andas a result of the equations (3.1) and (3.2),is to beAnda cascade is performed. In the formula (3.4)Is the vector obtained by the final fusion,is the result obtained by equation (3.3),andthe results obtained by the formula (3.1) and the formula (3.2) are finally matchedFull connection is carried out to obtain the final resultThe formulation is described as follows:
wherein,is a function of the full connection of the network,is the output of equation (3.4),is the final prediction result.
After the final result is obtained, cross entropy loss is adopted for the result to supervise the training of the model, and the formula is described as follows:
wherein,refers to the result obtained in equation (3.5),that is to correspond toThe result of the time slice prediction of (1), namelyThe predicted result of each character.Is thatThe true value of the individual character, here the formula, as in formula (1.10) and formula (2.11),it is the fusion module that uses cross-entropy loss to supervise the training of the model.
Further, the prediction module identifies the enhanced visual features and the enhanced semantic features to obtain a final text identification result;
and for the enhanced visual features and the enhanced semantic features, obtaining an overall prediction result by adopting a full connection layer, and performing supervised training by using a third cross entropy loss.
Further, the training process of the trained deep learning model comprises:
constructing a training set; the training set is a natural scene image of a known text recognition result;
inputting the training set into a deep learning model, training the model, and stopping training when the total loss function value is minimum or reaches a set iteration number to obtain a trained deep learning model;
wherein the total loss function is a summation result of the first, second and third cross-entropy loss functions.
Wherein, the total loss function is:
the objective function is composed of three parts, wherein,,,is a hyper-parameter for the balancing,is a loss of a portion of the vision,is the loss of a part of the semantic meaning,is fusion loss. The text recognition loss functions are all cross-entropy loss functions.
Further, the constructing a training set includes:
acquiring a natural scene image;
carrying out augmentation processing on the natural scene image;
and carrying out size normalization processing on the natural scene image after the augmentation processing.
For example, the reason why the natural scene image after the augmentation processing needs to be subjected to the size normalization processing is that the text image in the natural scene has various shapes and length-width ratios, and in order to obtain a model with better generalization performance, the width and the height of the text image are respectively set to be the width preset value and the height preset value.
Illustratively, the natural scene image needs to be subjected to augmentation processing because of the problems of rotation, perspective distortion, blurring and the like widely existing in the natural scene, and the embodiment uses a probability preset value to use an image augmentation mode of randomly adding gaussian noise, rotation, perspective distortion and the like to the original image.
For the real text label of the image, a character dictionary containing all English characters, 0-9 arrays and ending characters is used, and since the task is to classify the sequence, the embodiment uses a length preset value to specify the maximum length of the text.
Example two
The embodiment provides a text recognition system for natural scene images;
a system for text recognition of images of natural scenes, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
It should be noted here that the above-mentioned acquiring module and the identifying module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (3)
1. The text recognition method of the natural scene image is characterized by comprising the following steps:
acquiring a natural scene image to be identified;
performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features;
the deep learning model comprises: the correction module is connected with the input end of a backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the input end of the semantic feature extraction module; the output end of the visual characteristic extraction module and the output end of the semantic characteristic extraction module are both connected with the input end of the visual semantic characteristic fusion module, the output end of the visual semantic characteristic fusion module is connected with the input end of the prediction module, and the output end of the prediction module outputs a text recognition result;
the backbone network is realized by adopting a ResNet convolution neural network and is used for extracting spatial features of a natural scene image, position information embedding is carried out on the spatial features by adopting position coding, and feature enhancement is carried out on the spatial features after the position information embedding by adopting a first Transformer neural network to obtain feature vectors of the image;
the visual feature extraction module is used for separating a visual part and a semantic part of the image feature vector by adopting a second transform neural network, and decoding the visual part by adopting a position attention mechanism module to obtain visual features; the position attention mechanism module is used for inquiring self-attention mechanism self-attentionKey, keySum valueReplacing with different elements; wherein the queryIs replaced by position code, keyIs replaced by the output value, value of the UNet networkIs replaced byIdentity mapping of (2);
wherein,is a position code, and the position code is,it is referred to as an input image,refer to a residual neural network that is,refers to a Transformer network;
wherein, in the formula (1.5)Are all obtained from the formula (1.4)The identity of the image to be scanned is mapped,is the dimension of the feature that is,is a hyper-parameter which is the parameter,refer toA function;
wherein,is a characteristic obtained by the formula (1.5),is a network of UNet's, and,is the resulting bond;
the semantic feature extraction module adopts an attention mechanism module to align the feature vectors of the images, and then decodes the aligned data to obtain semantic features; wherein, the alignment processing refers to aligning the two-dimensional features into one-dimensional features; the aligned data is decoded and realized by adopting a long-term and short-term memory network.
2. The method for recognizing texts in images of natural scenes according to claim 1, wherein the visual semantic feature fusion module is configured to interact with the visual features and the semantic features to obtain enhanced visual features and enhanced semantic features; and adopting a door mechanism decision to fuse the enhanced visual features and the enhanced semantic features.
3. The text recognition system for a natural scene image using the text recognition method for a natural scene image according to claim 1, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210483188.4A CN114581906B (en) | 2022-05-06 | 2022-05-06 | Text recognition method and system for natural scene image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210483188.4A CN114581906B (en) | 2022-05-06 | 2022-05-06 | Text recognition method and system for natural scene image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114581906A CN114581906A (en) | 2022-06-03 |
CN114581906B true CN114581906B (en) | 2022-08-05 |
Family
ID=81784282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210483188.4A Active CN114581906B (en) | 2022-05-06 | 2022-05-06 | Text recognition method and system for natural scene image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114581906B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117912005B (en) * | 2024-03-19 | 2024-07-05 | 中国科学技术大学 | Text recognition method, system, device and medium using single mark decoding |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033008A (en) * | 2019-04-29 | 2019-07-19 | 同济大学 | A kind of iamge description generation method concluded based on modal transformation and text |
CN114219990A (en) * | 2021-11-30 | 2022-03-22 | 南京信息工程大学 | Natural scene text recognition method based on representation batch normalization |
CN114299510A (en) * | 2022-03-08 | 2022-04-08 | 山东山大鸥玛软件股份有限公司 | Handwritten English line recognition system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767379B (en) * | 2020-06-29 | 2023-06-27 | 北京百度网讯科技有限公司 | Image question-answering method, device, equipment and storage medium |
CN112733768B (en) * | 2021-01-15 | 2022-09-09 | 中国科学技术大学 | Natural scene text recognition method and device based on bidirectional characteristic language model |
CN113343707B (en) * | 2021-06-04 | 2022-04-08 | 北京邮电大学 | Scene text recognition method based on robustness characterization learning |
CN113657399B (en) * | 2021-08-18 | 2022-09-27 | 北京百度网讯科技有限公司 | Training method of character recognition model, character recognition method and device |
-
2022
- 2022-05-06 CN CN202210483188.4A patent/CN114581906B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033008A (en) * | 2019-04-29 | 2019-07-19 | 同济大学 | A kind of iamge description generation method concluded based on modal transformation and text |
CN114219990A (en) * | 2021-11-30 | 2022-03-22 | 南京信息工程大学 | Natural scene text recognition method based on representation batch normalization |
CN114299510A (en) * | 2022-03-08 | 2022-04-08 | 山东山大鸥玛软件股份有限公司 | Handwritten English line recognition system |
Also Published As
Publication number | Publication date |
---|---|
CN114581906A (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Synthetically supervised feature learning for scene text recognition | |
CN113591546B (en) | Semantic enhancement type scene text recognition method and device | |
CN113343707B (en) | Scene text recognition method based on robustness characterization learning | |
CN111027562B (en) | Optical character recognition method based on multiscale CNN and RNN combined with attention mechanism | |
EP3772036A1 (en) | Detection of near-duplicate image | |
CN111428718A (en) | Natural scene text recognition method based on image enhancement | |
CN115471851A (en) | Burma language image text recognition method and device fused with double attention mechanism | |
CN112686219B (en) | Handwritten text recognition method and computer storage medium | |
CN111444367A (en) | Image title generation method based on global and local attention mechanism | |
CN113408535B (en) | OCR error correction method based on Chinese character level features and language model | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN114092930B (en) | Character recognition method and system | |
Nikitha et al. | Handwritten text recognition using deep learning | |
CN114581906B (en) | Text recognition method and system for natural scene image | |
CN111985525A (en) | Text recognition method based on multi-mode information fusion processing | |
CN114463688A (en) | Cross-modal context coding dialogue emotion recognition method and system | |
CN112149644A (en) | Two-dimensional attention mechanism text recognition method based on global feature guidance | |
CN117710986B (en) | Method and system for identifying interactive enhanced image text based on mask | |
CN116152824A (en) | Invoice information extraction method and system | |
Tayyab et al. | Recognition of Visual Arabic Scripting News Ticker From Broadcast Stream | |
CN114581920A (en) | Molecular image identification method for double-branch multi-level characteristic decoding | |
CN113158828B (en) | Facial emotion calibration method and system based on deep learning | |
CN112084788A (en) | Automatic marking method and system for implicit emotional tendency of image captions | |
CN113837231B (en) | Image description method based on data enhancement of mixed sample and label | |
CN115984877A (en) | Handwriting recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |