CN114581906B - Text recognition method and system for natural scene image - Google Patents

Text recognition method and system for natural scene image Download PDF

Info

Publication number
CN114581906B
CN114581906B CN202210483188.4A CN202210483188A CN114581906B CN 114581906 B CN114581906 B CN 114581906B CN 202210483188 A CN202210483188 A CN 202210483188A CN 114581906 B CN114581906 B CN 114581906B
Authority
CN
China
Prior art keywords
features
semantic
image
visual
natural scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210483188.4A
Other languages
Chinese (zh)
Other versions
CN114581906A (en
Inventor
许信顺
王彬
罗昕
陈振铎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210483188.4A priority Critical patent/CN114581906B/en
Publication of CN114581906A publication Critical patent/CN114581906A/en
Application granted granted Critical
Publication of CN114581906B publication Critical patent/CN114581906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of data identification, and discloses a text identification method and a text identification system for natural scene images, wherein the method comprises the following steps: acquiring a natural scene image to be identified; performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text; firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and then respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features. The method can identify the scene texts in any shapes, has wide application scenes and strong generalization capability of the model, and can be applied to various scenes of text identification.

Description

Text recognition method and system for natural scene image
Technical Field
The invention relates to the technical field of data identification, in particular to a text identification method and system of a natural scene image.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Text recognition is one of the branches of the field of computer vision research, belongs to pattern recognition and artificial intelligence, and is an important component of computer science. The method is to recognize the text in the natural scene by using the computer technology and convert the text into a format which can be displayed by a computer and understood by people. Through text recognition, the information processing speed can be greatly increased.
Due to the sequence of the text, a scene text recognition task has a huge semantic gap between a two-dimensional text image and a one-dimensional sequence text, and the scene text recognition has unique difficulty. People recognize that text in a scene is affected by a variety of factors, such as background, color, and even culture.
For example, when a person sees the white-background yellow-symbol marked by mcdonald's mark, he or she can know that mcdonald's writing is mcdonald's work subconsciously, and often, after seeing a scene for the first time, the person can determine what the text is without reflecting what is written. However, these complex scenes are not a side effect for the computer to recognize the text. In reality, scenes are extremely complex and are different in font, style, background and the like, and the interferences cause a computer to frequently and wrongly recognize certain characters, and the recognition result always deviates from the meaning of a text when a certain character is wrongly recognized.
The current text recognition method can be roughly divided into two directions, wherein one direction is to optimize the visual characteristics by utilizing the strong characteristic extraction capability of a neural network in the deep learning era to obtain higher characteristic extraction capability, and the other direction is to perform semantic enhancement on the characteristics extracted by a two-dimensional visual image by a characteristic extractor or correct the result obtained by a visual model according to the semantics of the text.
In recent years, most models with the best effect are semantic-based models, and in such methods, a feature extractor is generally used to extract a two-dimensional picture into a visual feature map, and then the semantic features are obtained by further encoding the feature map extracted by the visual model by using the semantic model. And then comprehensively utilizing the visual features and the semantic features to obtain a final recognition result. But this approach can result in semantic models that are highly dependent on visual features.
This way of coupling semantic features with visual features has two disadvantages: one is that the semantic model is only used to correct the results obtained by the visual model. The semantic model is trained end-to-end in the whole process, but the semantic model is actually independent of the whole model as a post-processing mode, so that the model becomes large, and the gradient chain is difficult to train due to the growth. Secondly, the result of correcting the visual model by using the semantic model does enhance the recognition capability of the model through such a structure, but there are various wrong text information in the natural scene, for example, for recognition and correction of handwritten test paper, and for wrong text, after using the semantic model, the model will automatically correct the wrong text into correct words, which undoubtedly deviates from the intention of correction operation.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a text recognition method and a text recognition system for natural scene images; the semantic independent text recognition network is different from the prior model which only utilizes truncation gradient to decouple the visual and semantic models, and adjusts the model structure to realize complete decoupling on the structure. A new semantic module is designed for fully utilizing semantic information. A new visual semantic feature fusion module is designed, so that the visual features and the semantic features are fully interacted, and the visual features and the semantic features are fully utilized. And fusing the enhanced visual information and the semantic information by using a door mechanism to obtain a final prediction result.
In a first aspect, the invention provides a text recognition method for natural scene images;
the text recognition method for the natural scene image comprises the following steps:
acquiring a natural scene image to be identified;
performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
In a second aspect, the present invention provides a text recognition system for images of natural scenes;
a system for text recognition of images of natural scenes, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
Compared with the prior art, the invention has the beneficial effects that:
1) the method can identify the scene texts in any shapes, has wide application scenes and strong generalization capability of the model, and can be applied to various scenes of text identification.
2) The invention adjusts the model structure, thereby reducing the coupling degree of the semantic module to the visual module, and enabling the visual module to become an independent part, thereby enabling the model to be easier to train end to end without complex multi-stage or pre-training process.
3) The invention provides a new semantic module for text recognition, which can process the semantic information of a text equivalently with a visual module and obtain a preliminary semantic recognition result at the branch for guiding model training.
4) The invention provides a new visual semantic fusion mode, which enables the interaction of equivalent semantic features and visual features, can fully utilize the information of the equivalent semantic features and the visual features, and adopts a door mechanism to make final decision on the information fused by the equivalent semantic features and the visual features, thereby obtaining a better recognition result.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a functional block diagram of the entire network;
fig. 2 is a schematic diagram of feature fusion of visual features and semantic features.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
Example one
The embodiment provides a text recognition method of a natural scene image;
the text recognition method of the natural scene image comprises the following steps:
s101: acquiring a natural scene image to be identified;
s102: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and then respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
Further, as shown in fig. 1, the deep learning model includes: the correction module is connected with the input end of a backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the input end of the semantic feature extraction module; the output end of the visual characteristic extraction module and the output end of the semantic characteristic extraction module are both connected with the input end of the visual semantic characteristic fusion module, the output end of the visual semantic characteristic fusion module is connected with the input end of the prediction module, and the output end of the prediction module outputs a text recognition result.
Further, the rectification module is implemented by using a thin-Plate spline interpolation algorithm tps (thin Plate spline) and is used for rectifying the curved image into a regular shape.
It should be understood that there are widely curved texts in natural scenes, and these huge changes in image background, appearance and layout pose a significant challenge to the text Recognition task, and the conventional Optical Character Recognition (OCR) method cannot effectively handle. It is difficult to train a recognition model capable of dealing with various scenes, so the embodiment corrects the image by using the thin-plate spline interpolation algorithm TPS to obtain a more regular text. The key problem of the thin-plate spline interpolation algorithm TPS is to locate 2J control points and the transformation matrix between the control points, where J is a hyper-parameter. The 2J control points are obtained by means of regression, specifically, after the input image is sent to a convolutional neural network with 3 layers to extract image features, 2J outputs are obtained by using full-connection operation, and the 2J outputs correspond to the control points. The transformation matrix is an analytical solution obtained by a norm between the pixel point and the control point and a norm between the control points. Through the step, the text image after preliminary correction can be obtained.
Furthermore, the backbone network is implemented by adopting a ResNet (residual Neural network) convolutional Neural network, and is used for extracting spatial features of the natural scene image, embedding position information into the spatial features by adopting position coding, and performing feature enhancement on the spatial features after embedding the position information by adopting a first Transformer Neural network to obtain feature vectors of the image.
Wherein, the position code means:
Figure 810935DEST_PATH_IMAGE001
;(1.1)
Figure 49018DEST_PATH_IMAGE002
;(1.2)
wherein the content of the first and second substances,
Figure 597418DEST_PATH_IMAGE003
the corresponding position code is referred to, and is a one-dimensional vector.
Wherein the content of the first and second substances,
Figure 946360DEST_PATH_IMAGE004
correspond to
Figure 899534DEST_PATH_IMAGE005
A position of
Figure 308519DEST_PATH_IMAGE006
The value of each dimension.
Figure 78636DEST_PATH_IMAGE007
;(1.3)
It should be understood that because scene text images usually adopt a deep convolutional neural network as an image encoder due to the style background, font style, and noise in the shooting process, this embodiment selects ResNet45 as a backbone network, and by using residual connection in the network, model degradation caused by the network being too deep is reduced, and the problem of network gradient disappearance is also reduced. In addition, due to the sequence of the text, after the residual network is used, the position coding and the Transformer are used to acquire the enhanced features, the position information of the image can be introduced by using the position coding, so that the neural network can pay more attention to the position relation between the features in the two-dimensional feature map, and for the Transformer, the query here is
Figure 231268DEST_PATH_IMAGE008
Key, key
Figure 38949DEST_PATH_IMAGE009
Value of
Figure 618835DEST_PATH_IMAGE010
The three are self, and the method can deeply dig the relationship of the internal features of the feature map. The formulation is described as follows:
Figure 141827DEST_PATH_IMAGE011
;(1.4)
wherein the content of the first and second substances,
Figure 832572DEST_PATH_IMAGE012
is the position code obtained in equation (1.1),
Figure 494759DEST_PATH_IMAGE013
it is referred to as an input image,
Figure 245546DEST_PATH_IMAGE014
refer to a residual neural network that is,
Figure 990255DEST_PATH_IMAGE015
the method refers to a Transformer network, and the operation of the step refers to inputting the image into a ResNet45 network to obtain the characteristics, and adding the characteristics and the position codes to send the characteristics to the Transformer network.
Further, the visual feature extraction module separates a visual part and a semantic part of the image feature vector by adopting a second transform neural network, and decodes the visual part by adopting a position attention mechanism module to obtain visual features;
the position attention mechanism module is used for inquiring self-attention mechanism self-attention
Figure 484690DEST_PATH_IMAGE008
Key, key
Figure 266964DEST_PATH_IMAGE016
Value of
Figure 188652DEST_PATH_IMAGE010
Replacing with different elements; wherein the query
Figure 420657DEST_PATH_IMAGE008
Is replaced by position code, key
Figure 453204DEST_PATH_IMAGE016
Is replaced by the output value, value of the UNet network
Figure 89984DEST_PATH_IMAGE010
Is replaced byChange to
Figure 743426DEST_PATH_IMAGE017
Identity mapping of.
The position attention mechanism is shown in formula (1.6) -formula (1.8).
After the visual features obtained by the second transform network are decoded by using a position attention mechanism module, a decoding result is predicted by adopting a full-connection layer, and cross entropy loss in a formula (1.10) is used for supervision training.
Illustratively, the visual feature extraction module, in this embodiment, uses a transform to further extract high-dimensional visual features so as to reduce the dependency of the next decoding on the backbone network, so that the decoding of the semantic part is decoupled from the decoding of the visual part. The visual part is characterized by two-dimensional features and the semantic part is characterized by one-dimensional result sequence, which is the fundamental difference between the two. By this step according to the characteristics
Figure 964192DEST_PATH_IMAGE018
To obtain
Figure 301894DEST_PATH_IMAGE019
The formula is as follows:
Figure 557295DEST_PATH_IMAGE020
;(1.5)
wherein, in the formula (1.5)
Figure 342759DEST_PATH_IMAGE021
Are all obtained from the formula (1.4)
Figure 722925DEST_PATH_IMAGE021
The identity of the image to be scanned is mapped,
Figure 362854DEST_PATH_IMAGE022
is the dimension of the feature that is,
Figure 974226DEST_PATH_IMAGE022
is a hyper-parameter which is the parameter,
Figure 408618DEST_PATH_IMAGE023
refer to
Figure 836932DEST_PATH_IMAGE023
A function. By this step, the visual part is made to extract deeper decoding, so that the decoding of the visual part is separated from the semantic part.
Next, the visual features are converted to character sequences using a positional attention mechanism module, also using the self-attention formula, but different from (1.5)
Figure 14972DEST_PATH_IMAGE008
Figure 480851DEST_PATH_IMAGE016
And
Figure 771631DEST_PATH_IMAGE010
all are used as
Figure 955750DEST_PATH_IMAGE021
Is mapped to the identity of (1.8)
Figure 671902DEST_PATH_IMAGE008
Figure 254936DEST_PATH_IMAGE016
And
Figure 31131DEST_PATH_IMAGE010
different coding schemes are used.
Of formula (1.5)
Figure 499284DEST_PATH_IMAGE021
And in equation (1.6)
Figure 19127DEST_PATH_IMAGE008
In contrast, the query of formula (1.5) is intended toIn encoding the positional relationship of two-dimensional visual features, it is therefore actually adopted that the features
Figure 456668DEST_PATH_IMAGE021
Herein, however
Figure 403765DEST_PATH_IMAGE008
The coding is the sequence relation between the characters in the words, so the word embedding is adopted for the position sequence, and the formula description is as follows:
Figure 296897DEST_PATH_IMAGE024
(1.6)
wherein the content of the first and second substances,
Figure 620430DEST_PATH_IMAGE025
a word-embedding function is used that,
Figure 912478DEST_PATH_IMAGE026
is the order of characters, including [0,1, … N],
Figure 30476DEST_PATH_IMAGE008
The corresponding position code.
Figure 410904DEST_PATH_IMAGE027
Then UNet network is used, where the UNet network does not use character level segmentation to guide training, but as a feature enhancement mode, where new features with the same size as the original feature map are obtained through UNet network, and the formulation is described as follows:
Figure 538129DEST_PATH_IMAGE028
(1.7)
wherein the content of the first and second substances,
Figure 684683DEST_PATH_IMAGE029
is a characteristic obtained by the formula (1.5),
Figure 973582DEST_PATH_IMAGE030
is a network of UNet's, and,
Figure 841306DEST_PATH_IMAGE016
is the resulting bond.
Figure 506642DEST_PATH_IMAGE010
Then the identity map is used, the characteristics used
Figure 445386DEST_PATH_IMAGE029
Similarly, the final characteristic diagram of the visual part is obtained by using self-attention formula
Figure 170766DEST_PATH_IMAGE031
And after decoding, obtaining the result of the visual part.
Figure 260207DEST_PATH_IMAGE032
;(1.8)
Figure 729234DEST_PATH_IMAGE033
;(1.9)
Wherein, in the formula (1.8)
Figure 420739DEST_PATH_IMAGE008
Corresponding to that in equation (1.6)
Figure 317019DEST_PATH_IMAGE008
Figure 893756DEST_PATH_IMAGE016
Corresponding to that in equation (1.7)
Figure 838579DEST_PATH_IMAGE016
Figure 315696DEST_PATH_IMAGE010
Is that
Figure 881414DEST_PATH_IMAGE029
Identity mapping of, obtained
Figure 443982DEST_PATH_IMAGE034
Is the final feature of the visual part; in the formula (1.8)
Figure 756277DEST_PATH_IMAGE035
Is a function of the full connection of the network,
Figure 87901DEST_PATH_IMAGE034
is the output obtained in equation (1.8),
Figure 824520DEST_PATH_IMAGE036
refer to the dimensions of the space in which,
Figure 874384DEST_PATH_IMAGE037
is the maximum length of a word, here a predefined value,
Figure 724791DEST_PATH_IMAGE038
is the number of categories of characters, again a predefined value,
Figure 176500DEST_PATH_IMAGE039
is the result of the recognition of the visual part,
Figure 756124DEST_PATH_IMAGE040
representing the set of whole real numbers.
The present embodiment introduces cross-entropy loss to guide the training formula of the visual part as follows:
Figure 293285DEST_PATH_IMAGE041
;(1.10)
wherein the content of the first and second substances,
Figure 947382DEST_PATH_IMAGE042
it means that in the formula (1.9)
Figure 253599DEST_PATH_IMAGE043
First, the
Figure 332020DEST_PATH_IMAGE044
The number of the time slices is equal to the total number of the time slices,
Figure 90897DEST_PATH_IMAGE045
is the label of the real object,
Figure 719325DEST_PATH_IMAGE046
what corresponds is the length of the word.
Figure 647092DEST_PATH_IMAGE047
It is to perform cross-entropy loss with the true value for each predicted character.
Furthermore, the semantic feature extraction module aligns feature vectors of the images by using an attention mechanism module, and then decodes the aligned data to obtain semantic features.
The alignment process refers to aligning two-dimensional features into one-dimensional features.
The aligned data is decoded and implemented by using a Long Short-Term Memory network (LSTM).
The long-term and short-term memory network adopts a full connection layer to predict semantic features, and uses a second cross entropy loss function to perform supervision training.
Illustratively, because the semantic feature extraction module couples to the visual feature extraction module, training of the semantic feature extraction module greatly depends on the visual feature extraction module, and therefore the semantic feature extraction module is independent in the embodiment, and the features of the backbone network are independently encoded, so that the coupling degree of the model is reduced.
The semantic feature extraction module aligns two-dimensional features obtained by the backbone network by adopting an attention mechanism to enable the two-dimensional features to become well-aligned one-dimensional featuresIn addition, the present embodiment adds the lead-in position information to the attention mechanism
Figure 397879DEST_PATH_IMAGE048
So that the attention mechanism focuses not only on the area with discrimination but also on the position relationship of the text between the images, the formulation is described as follows:
Figure 142587DEST_PATH_IMAGE049
;(2.1)
Figure 574706DEST_PATH_IMAGE050
;(2.2)
wherein, in the formula (2.1)
Figure 356979DEST_PATH_IMAGE051
Figure 278668DEST_PATH_IMAGE052
Figure 510673DEST_PATH_IMAGE053
Are all of the parameters that are trainable,
Figure 543220DEST_PATH_IMAGE054
is the order of characters, including [0,1, … N],
Figure 616218DEST_PATH_IMAGE055
What is done is a word-embedding operation,
Figure 210272DEST_PATH_IMAGE056
is the eigenvector obtained in equation (1.4); in the formula (2.2)
Figure 431038DEST_PATH_IMAGE057
Is referred to as
Figure 765811DEST_PATH_IMAGE058
An exponential function of base, includingThe values in the numbers are the results obtained by equation (2.1), i.e.
Figure 755633DEST_PATH_IMAGE059
The index of (a) is,
Figure 520589DEST_PATH_IMAGE060
is a preset maximum length of a word, other throughout
Figure 228651DEST_PATH_IMAGE060
Are all the same hyper-parameter,
Figure 328235DEST_PATH_IMAGE061
obtained is that
Figure 172563DEST_PATH_IMAGE062
Time of day position
Figure 108420DEST_PATH_IMAGE063
By multiplying the weight back to the feature sequence
Figure 38199DEST_PATH_IMAGE064
The aligned one-dimensional sequence features can be obtained by the following formula:
Figure 980354DEST_PATH_IMAGE065
;(2.3)
wherein the content of the first and second substances,
Figure 944768DEST_PATH_IMAGE066
is the feature vector obtained by equation (1.4),
Figure 51526DEST_PATH_IMAGE067
is the weight obtained by equation (2.2),
Figure 734181DEST_PATH_IMAGE068
the resulting aligned one-dimensional sequence.
After the aligned one-dimensional sequence features are obtained, decoding the aligned one-dimensional sequence features by using a Long Short Term Memory (LSTM), wherein due to a time factor, decoding is performed in one step after a result is obtained instead of an autoregressive mode, and a specific formula is described as follows:
Figure 948868DEST_PATH_IMAGE069
;(2.4)
Figure 767788DEST_PATH_IMAGE070
;(2.5)
Figure 311027DEST_PATH_IMAGE071
;(2.6)
Figure 215398DEST_PATH_IMAGE072
;(2.7)
Figure 233776DEST_PATH_IMAGE073
;(2.8)
Figure 172782DEST_PATH_IMAGE074
;(2.9)
Figure 621343DEST_PATH_IMAGE075
;(2.10)
wherein, the first and the second end of the pipe are connected with each other,
Figure 950694DEST_PATH_IMAGE076
Figure 835080DEST_PATH_IMAGE077
Figure 628592DEST_PATH_IMAGE078
is three gates corresponding to the LSTM
Figure 248055DEST_PATH_IMAGE079
Correspond to
Figure 127018DEST_PATH_IMAGE080
Function, all of
Figure 487199DEST_PATH_IMAGE081
And
Figure 135218DEST_PATH_IMAGE082
are all parameters that are used for learning,
Figure 191161DEST_PATH_IMAGE083
corresponding to the hidden layer state of the last time slice,
Figure 495103DEST_PATH_IMAGE084
is the result obtained by the formula (2.3), corresponding to
Figure 894860DEST_PATH_IMAGE085
The input of the time of day is,
Figure 161500DEST_PATH_IMAGE086
is that
Figure 621300DEST_PATH_IMAGE086
The function of the function is that of the function,
Figure 976321DEST_PATH_IMAGE087
the correspondence is to a dot-by-dot product,
Figure 179769DEST_PATH_IMAGE088
is the output of the LSTM.
And after the full connection is carried out on the identification data, a final identification result is obtained:
Figure 363232DEST_PATH_IMAGE089
;(2.11)
wherein the content of the first and second substances,
Figure 760978DEST_PATH_IMAGE090
is the output of the LSTM in equation (2.10),
Figure 773933DEST_PATH_IMAGE091
is a full-connection layer, and is characterized in that,
Figure 781072DEST_PATH_IMAGE092
is the final prediction result of the semantic model,
Figure 780163DEST_PATH_IMAGE093
refer to the dimensions of the space in which,
Figure 847345DEST_PATH_IMAGE094
is the maximum length of a word, here a predefined value,
Figure 911378DEST_PATH_IMAGE095
is the number of categories of characters, again a predefined value,
Figure 394312DEST_PATH_IMAGE096
is the decoded visual feature.
Through LSTM decoding, the decoded semantic features are obtained, the recognition result of the semantic part is obtained after alignment and full connection, and similarly, for the semantic model, cross entropy is adopted, the recognition result of the part is supervised, and the formula description is as follows:
Figure 677001DEST_PATH_IMAGE097
;(2.12)
wherein, the first and the second end of the pipe are connected with each other,
Figure 416549DEST_PATH_IMAGE098
is obtained by the formula (2.10)
Figure 466414DEST_PATH_IMAGE099
The predicted value of the time of day,
Figure 313890DEST_PATH_IMAGE100
correspond to
Figure 500021DEST_PATH_IMAGE101
The true value of the time of day.
Figure 410471DEST_PATH_IMAGE102
Is the length of the character or characters,
Figure 947631DEST_PATH_IMAGE103
it is the loss between each predicted character and the real character that is sought.
Further, the visual semantic feature fusion module is used for interacting the visual features and the semantic features to obtain enhanced visual features and enhanced semantic features; and adopting a door mechanism decision to fuse the enhanced visual features and the enhanced semantic features.
Illustratively, the visual features are obtained by decoding the visual feature extraction module and the semantic feature extraction module respectively
Figure 598799DEST_PATH_IMAGE104
And semantic features
Figure 905015DEST_PATH_IMAGE105
. In order to fully utilize the features of the two, the present embodiment adopts a new visual semantic feature interaction mode, specifically, the above paradigm of self-attention is also followed, except that the fusion of the visual feature and the semantic feature is used here, so that one feature is used for query during interaction
Figure 720787DEST_PATH_IMAGE106
Another feature is indexed
Figure 745244DEST_PATH_IMAGE107
And key value
Figure 934523DEST_PATH_IMAGE108
Figure 360825DEST_PATH_IMAGE106
Corresponding to query, i.e. query, and key corresponding to index
Figure 347498DEST_PATH_IMAGE107
Value corresponds to a key value
Figure 796934DEST_PATH_IMAGE108
. The graphical representation is as shown in fig. 2, and the specific formulation is described as follows:
Figure 789904DEST_PATH_IMAGE109
;(3.1)
Figure 805134DEST_PATH_IMAGE110
;(3.2)
wherein the content of the first and second substances,
Figure 228287DEST_PATH_IMAGE111
corresponding to the semantic features obtained by equation (2.10),
Figure 961757DEST_PATH_IMAGE112
corresponding to the visual characteristics obtained by the formula (1.8),
Figure 758418DEST_PATH_IMAGE113
is the dimension of the feature that is,
Figure 893733DEST_PATH_IMAGE113
is a hyper-parameter which is a function of,
Figure 924006DEST_PATH_IMAGE114
is that
Figure 646237DEST_PATH_IMAGE114
The function of the function is that of the function,
Figure 482475DEST_PATH_IMAGE115
and
Figure 970831DEST_PATH_IMAGE116
a visually dominant fused vector and a semantically dominant fused vector.
Through the step, interactive visual features and semantic features are obtained, the visual information and the semantic information are fully utilized in the mode, then, a door mechanism is used for fusing the interactive features of the visual information and the semantic information, and the specific formula is as follows:
Figure 234322DEST_PATH_IMAGE117
;(3.3)
Figure 443849DEST_PATH_IMAGE118
;(3.4)
wherein, in the formula (3.3)
Figure 818199DEST_PATH_IMAGE119
Is a parameter that can be trained in a way that,
Figure 387762DEST_PATH_IMAGE120
and
Figure 822154DEST_PATH_IMAGE121
as a result of the equations (3.1) and (3.2),
Figure 518977DEST_PATH_IMAGE122
is to be
Figure 697017DEST_PATH_IMAGE123
And
Figure 159966DEST_PATH_IMAGE124
a cascade is performed. In the formula (3.4)
Figure 765260DEST_PATH_IMAGE125
Is the vector obtained by the final fusion,
Figure 683800DEST_PATH_IMAGE126
is the result obtained by equation (3.3),
Figure 665531DEST_PATH_IMAGE127
and
Figure 982987DEST_PATH_IMAGE128
the results obtained by the formula (3.1) and the formula (3.2) are finally matched
Figure 759182DEST_PATH_IMAGE129
Full connection is carried out to obtain the final result
Figure 430597DEST_PATH_IMAGE130
The formulation is described as follows:
Figure 950440DEST_PATH_IMAGE131
;(3.5)
wherein the content of the first and second substances,
Figure 122402DEST_PATH_IMAGE132
is a function of the full connection of the network,
Figure 335077DEST_PATH_IMAGE133
is the output of equation (3.4),
Figure 352843DEST_PATH_IMAGE134
is the final prediction result.
After the final result is obtained, cross entropy loss is adopted for the result to supervise the training of the model, and the formula is described as follows:
Figure 909333DEST_PATH_IMAGE135
;(3.6)
wherein the content of the first and second substances,
Figure 702845DEST_PATH_IMAGE136
refers to the result obtained in equation (3.5),
Figure 650204DEST_PATH_IMAGE137
that is to correspond to
Figure 529167DEST_PATH_IMAGE138
The result of the time slice prediction of (1), namely
Figure 889348DEST_PATH_IMAGE138
The predicted result of each character.
Figure 537367DEST_PATH_IMAGE139
Is that
Figure 593310DEST_PATH_IMAGE138
The true value of the individual character, here the formula, as in formula (1.10) and formula (2.11),
Figure 693990DEST_PATH_IMAGE140
it is the fusion module that uses cross-entropy loss to supervise the training of the model.
Further, the prediction module identifies the enhanced visual features and the enhanced semantic features to obtain a final text identification result;
and for the enhanced visual features and the enhanced semantic features, obtaining an overall prediction result by adopting a full connection layer, and performing supervised training by using a third cross entropy loss.
Further, the training process of the trained deep learning model comprises:
constructing a training set; the training set is a natural scene image of a known text recognition result;
inputting the training set into a deep learning model, training the model, and stopping training when the total loss function value is minimum or reaches a set iteration number to obtain a trained deep learning model;
wherein the total loss function is a summation result of the first, second and third cross-entropy loss functions.
Wherein, the total loss function is:
Figure 857862DEST_PATH_IMAGE141
;(3.7)
the objective function is composed of three parts, wherein,
Figure 625966DEST_PATH_IMAGE142
Figure 587231DEST_PATH_IMAGE143
Figure 440787DEST_PATH_IMAGE144
is a hyper-parameter for the balancing,
Figure 80453DEST_PATH_IMAGE145
is a loss of a portion of the vision,
Figure 703064DEST_PATH_IMAGE146
is the loss of a part of the semantic meaning,
Figure 835231DEST_PATH_IMAGE147
is fusion loss. The text recognition loss functions are all cross-entropy loss functions.
Further, the constructing a training set includes:
acquiring a natural scene image;
carrying out augmentation processing on the natural scene image;
and carrying out size normalization processing on the natural scene image after the augmentation processing.
For example, the reason why the natural scene image after the augmentation processing needs to be subjected to the size normalization processing is that the text image in the natural scene has various shapes and length-width ratios, and in order to obtain a model with better generalization performance, the width and the height of the text image are respectively set to be the width preset value and the height preset value.
Illustratively, the natural scene image needs to be subjected to augmentation processing because of the problems of rotation, perspective distortion, blurring and the like widely existing in the natural scene, and the embodiment uses a probability preset value to use an image augmentation mode of randomly adding gaussian noise, rotation, perspective distortion and the like to the original image.
For the real text label of the image, a character dictionary containing all English characters, 0-9 arrays and ending characters is used, and since the task is to classify the sequence, the embodiment uses a length preset value to specify the maximum length of the text.
Example two
The embodiment provides a text recognition system for natural scene images;
a system for text recognition of images of natural scenes, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
It should be noted here that the above-mentioned acquiring module and the identifying module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. The text recognition method of the natural scene image is characterized by comprising the following steps:
acquiring a natural scene image to be identified;
performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features;
the deep learning model comprises: the correction module is connected with the input end of a backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the input end of the semantic feature extraction module; the output end of the visual characteristic extraction module and the output end of the semantic characteristic extraction module are both connected with the input end of the visual semantic characteristic fusion module, the output end of the visual semantic characteristic fusion module is connected with the input end of the prediction module, and the output end of the prediction module outputs a text recognition result;
the backbone network is realized by adopting a ResNet convolution neural network and is used for extracting spatial features of a natural scene image, position information embedding is carried out on the spatial features by adopting position coding, and feature enhancement is carried out on the spatial features after the position information embedding by adopting a first Transformer neural network to obtain feature vectors of the image;
the visual feature extraction module is used for separating a visual part and a semantic part of the image feature vector by adopting a second transform neural network, and decoding the visual part by adopting a position attention mechanism module to obtain visual features; the position attention mechanism module is used for inquiring self-attention mechanism self-attention
Figure 804742DEST_PATH_IMAGE001
Key, key
Figure 268084DEST_PATH_IMAGE002
Sum value
Figure 81319DEST_PATH_IMAGE003
Replacing with different elements; wherein the query
Figure 48138DEST_PATH_IMAGE001
Is replaced by position code, key
Figure 288627DEST_PATH_IMAGE002
Is replaced by the output value, value of the UNet network
Figure 973686DEST_PATH_IMAGE004
Is replaced by
Figure 590612DEST_PATH_IMAGE005
Identity mapping of (2);
Figure 677517DEST_PATH_IMAGE006
;(1.4)
wherein the content of the first and second substances,
Figure 88906DEST_PATH_IMAGE007
is a position code, and the position code is,
Figure 261262DEST_PATH_IMAGE008
it is referred to as an input image,
Figure 416300DEST_PATH_IMAGE009
refer to a residual neural network that is,
Figure 354781DEST_PATH_IMAGE010
refers to a Transformer network;
Figure 937072DEST_PATH_IMAGE011
;(1.5)
wherein, in the formula (1.5)
Figure 596724DEST_PATH_IMAGE012
Are all obtained from the formula (1.4)
Figure 289873DEST_PATH_IMAGE012
The identity of the image to be scanned is mapped,
Figure 85791DEST_PATH_IMAGE013
is the dimension of the feature that is,
Figure 104562DEST_PATH_IMAGE013
is a hyper-parameter which is the parameter,
Figure 985931DEST_PATH_IMAGE014
refer to
Figure 748350DEST_PATH_IMAGE014
A function;
Figure 398775DEST_PATH_IMAGE015
(1.7)
wherein the content of the first and second substances,
Figure 322868DEST_PATH_IMAGE016
is a characteristic obtained by the formula (1.5),
Figure 957112DEST_PATH_IMAGE017
is a network of UNet's, and,
Figure 257643DEST_PATH_IMAGE018
is the resulting bond;
the semantic feature extraction module adopts an attention mechanism module to align the feature vectors of the images, and then decodes the aligned data to obtain semantic features; wherein, the alignment processing refers to aligning the two-dimensional features into one-dimensional features; the aligned data is decoded and realized by adopting a long-term and short-term memory network.
2. The method for recognizing texts in images of natural scenes according to claim 1, wherein the visual semantic feature fusion module is configured to interact with the visual features and the semantic features to obtain enhanced visual features and enhanced semantic features; and adopting a door mechanism decision to fuse the enhanced visual features and the enhanced semantic features.
3. The text recognition system for a natural scene image using the text recognition method for a natural scene image according to claim 1, comprising:
an acquisition module configured to: acquiring a natural scene image to be identified;
an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;
firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.
CN202210483188.4A 2022-05-06 2022-05-06 Text recognition method and system for natural scene image Active CN114581906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210483188.4A CN114581906B (en) 2022-05-06 2022-05-06 Text recognition method and system for natural scene image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210483188.4A CN114581906B (en) 2022-05-06 2022-05-06 Text recognition method and system for natural scene image

Publications (2)

Publication Number Publication Date
CN114581906A CN114581906A (en) 2022-06-03
CN114581906B true CN114581906B (en) 2022-08-05

Family

ID=81784282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210483188.4A Active CN114581906B (en) 2022-05-06 2022-05-06 Text recognition method and system for natural scene image

Country Status (1)

Country Link
CN (1) CN114581906B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912005A (en) * 2024-03-19 2024-04-19 中国科学技术大学 Text recognition method, system, device and medium using single mark decoding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033008A (en) * 2019-04-29 2019-07-19 同济大学 A kind of iamge description generation method concluded based on modal transformation and text
CN114219990A (en) * 2021-11-30 2022-03-22 南京信息工程大学 Natural scene text recognition method based on representation batch normalization
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767379B (en) * 2020-06-29 2023-06-27 北京百度网讯科技有限公司 Image question-answering method, device, equipment and storage medium
CN112733768B (en) * 2021-01-15 2022-09-09 中国科学技术大学 Natural scene text recognition method and device based on bidirectional characteristic language model
CN113343707B (en) * 2021-06-04 2022-04-08 北京邮电大学 Scene text recognition method based on robustness characterization learning
CN113657399B (en) * 2021-08-18 2022-09-27 北京百度网讯科技有限公司 Training method of character recognition model, character recognition method and device
CN114255456A (en) * 2021-11-23 2022-03-29 金陵科技学院 Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033008A (en) * 2019-04-29 2019-07-19 同济大学 A kind of iamge description generation method concluded based on modal transformation and text
CN114219990A (en) * 2021-11-30 2022-03-22 南京信息工程大学 Natural scene text recognition method based on representation batch normalization
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system

Also Published As

Publication number Publication date
CN114581906A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
Liu et al. Synthetically supervised feature learning for scene text recognition
CN113343707B (en) Scene text recognition method based on robustness characterization learning
CN111027562B (en) Optical character recognition method based on multiscale CNN and RNN combined with attention mechanism
EP3772036A1 (en) Detection of near-duplicate image
CN111428718A (en) Natural scene text recognition method based on image enhancement
CN113591546A (en) Semantic enhanced scene text recognition method and device
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN115471851A (en) Burma language image text recognition method and device fused with double attention mechanism
CN113408535B (en) OCR error correction method based on Chinese character level features and language model
CN111444367A (en) Image title generation method based on global and local attention mechanism
Elagouni et al. Text recognition in multimedia documents: a study of two neural-based ocrs using and avoiding character segmentation
CN114092930B (en) Character recognition method and system
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
US20220164533A1 (en) Optical character recognition using a combination of neural network models
CN114581906B (en) Text recognition method and system for natural scene image
CN115761757A (en) Multi-mode text page classification method based on decoupling feature guidance
CN112149644A (en) Two-dimensional attention mechanism text recognition method based on global feature guidance
Nikitha et al. Handwritten text recognition using deep learning
Tayyab et al. Recognition of visual arabic scripting news ticker from broadcast stream
CN112686219B (en) Handwritten text recognition method and computer storage medium
CN114581920A (en) Molecular image identification method for double-branch multi-level characteristic decoding
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN113837231B (en) Image description method based on data enhancement of mixed sample and label
CN116152824A (en) Invoice information extraction method and system
CN114495076A (en) Character and image recognition method with multiple reading directions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant