CN116052154B - Scene text recognition method based on semantic enhancement and graph reasoning - Google Patents

Scene text recognition method based on semantic enhancement and graph reasoning Download PDF

Info

Publication number
CN116052154B
CN116052154B CN202310341392.7A CN202310341392A CN116052154B CN 116052154 B CN116052154 B CN 116052154B CN 202310341392 A CN202310341392 A CN 202310341392A CN 116052154 B CN116052154 B CN 116052154B
Authority
CN
China
Prior art keywords
text
recognition
semantic
visual
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310341392.7A
Other languages
Chinese (zh)
Other versions
CN116052154A (en
Inventor
郑金志
张立波
武延军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Software Technology Research Institute
Institute of Software of CAS
Original Assignee
Zhongke Nanjing Software Technology Research Institute
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Software Technology Research Institute, Institute of Software of CAS filed Critical Zhongke Nanjing Software Technology Research Institute
Priority to CN202310341392.7A priority Critical patent/CN116052154B/en
Publication of CN116052154A publication Critical patent/CN116052154A/en
Application granted granted Critical
Publication of CN116052154B publication Critical patent/CN116052154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a scene text recognition method based on semantic enhancement and graph reasoning, which relates to the technical field of machine vision and natural language, and comprises a vision recognition branch and an iteration correction branch; the visual recognition branch extracts visual characteristics of the scene text by a convolution network, and performs parallel coding and decoding on the visual characteristics; the iterative correction branch corrects the current recognition result by setting a semantic enhancement module, a fusion door and an reasoning module, wherein the semantic enhancement module enhances text semantic features by utilizing a character context in the text, and the recognition precision is improved; the fusion door comprehensively considers visual and semantic information by fusing text semantic features and text visual features of the recognition module; the graph reasoning module establishes an association relation between text characters, performs reasoning according to the association relation, corrects text characters with unobvious characteristics, and further improves the recognition accuracy of scene texts. The method and the device improve the recognition accuracy of the network to the scene text.

Description

Scene text recognition method based on semantic enhancement and graph reasoning
Technical Field
The invention relates to the technical field of machine vision and natural language processing, in particular to a scene text recognition method based on semantic enhancement and graph reasoning.
Background
The characters are one of the biggest inventions of human beings and are vectors of human civilization. More than 70% of the information obtained by humans from the outside comes from the visual system. Vision and text are the most common ways for humans to record and express cognition to the world. How scene text is identified from a visual image is critical to properly understanding the image. Scene text recognition is one of research hotspots for visual understanding, and has wide application prospects in computer vision, such as automatic driving, blind person assistance, intelligent transportation, visual text questions and answers and the like. Therefore, scene text recognition has received a great deal of attention.
The approach of scene text recognition has made significant progress over the past decades, but still faces significant challenges. The text in the scene text image has the characteristics of text diversity and background complexity. For example, the scene text is complex and changeable in terms of font size, color, shape, aspect ratio and the like, and the image of the scene text has the problems of shielding, distortion, blurring, uneven illumination, low text resolution, strong text background interference and the like. In order to solve the problems and further improve the recognition accuracy of the scene text, a large number of scholars continuously propose new scene text recognition algorithms.
Before the neural network is widely used, the traditional text recognition method requires a designer to put efforts on the design of text features, the advantages and disadvantages of the text features depend on the experience of the designer, and the advantages and disadvantages of the text features are important to the recognition effect of the text, so that the text recognition accuracy of the method is lower, and the robustness is poor. The wide application of neural networks enables designers to put more effort on model design, greatly improving the accuracy of text recognition. The scene text recognition method based on the neural network regards the recognition task as a sequence generation task, generally adopts a convolution network to extract visual characteristics, and then decodes the visual characteristics by using a language model to generate text recognition content. Two classes can be distinguished, based on recurrent neural networks (Recurrent Neural Network, RNN) and on transformer, according to the type of decoder. The recurrent neural network-based method decodes text from visual features in a recursive manner, which cannot be decoded in parallel in both training and forward reasoning stages, and thus the decoding speed is slow. In addition, RNN was originally proposed for processing one-dimensional feature information, and it is not suitable to directly process two-dimensional information of text in a scene text image, and this method has low recognition accuracy for the scene text. The method based on the transformer can train in parallel, so that training efficiency of the model is improved, but the forward reasoning stage cannot decode in parallel. The current neural network-based method has less exploration of semantic information in scene texts, and when the scene text images have the conditions of low resolution, poor illumination quality, random and changeable scene text shapes and the like, the problems of low scene text recognition precision, low recognition speed and the like are caused.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a scene text recognition method based on semantic enhancement and graph reasoning, which is used for improving the accuracy of the original scene text recognition method on scene text recognition.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a scene text recognition method based on semantic enhancement and graph reasoning comprises the following steps:
s1: acquiring a scene text image dataset for identifying network training;
s2: preprocessing the acquired images in the scene text image dataset;
s3: inputting the preprocessed image into a visual recognition branch of a recognition network, extracting text visual features in the image by the visual recognition branch through an internally arranged convolution network, and then completing visual recognition of the image based on the text visual features;
s4: inputting the output result of the visual recognition branch into an iterative correction branch of the recognition network, performing iterative correction on the visual recognition result by using ideas based on semantic enhancement and graph reasoning, and outputting a final recognition result to complete the training of the recognition network;
s5: the trained recognition network is used for recognizing the scene text.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, in step S2, the specific content of the preprocessing includes:
the image is randomly rotated, randomly sheared, randomly adjusted for brightness, randomly adjusted for contrast, randomly adjusted for size, randomly adjusted for saturation, randomly adjusted for gray value.
Further, in step S3, the visual recognition branch extracts text visual features in the image through the internally-arranged convolution network, and then completes visual recognition of the image based on the text visual features as follows:
s31: inputting an image to be identified into a visual identification branch, and extracting text visual characteristics in the image by a convolution network; wherein the convolution network adoptsResnetConvolutional network, usingResnetConvolutional network extraction of text visual features in images
Figure SMS_1
The expression calculation formula is:
Figure SMS_2
in the method, in the process of the invention,
Figure SMS_3
representing an image; />
Figure SMS_4
Representing text visual feature extractionResnetA convolutional network; />
Figure SMS_5
,/>
Figure SMS_6
Respectively represent image +.>
Figure SMS_7
Is the height and width of (2); />
Figure SMS_8
A dimension representing a visual feature of the text;
s32: inputting characters in a text to be recognized in an image to a position embedding layer for position coding;
s33: from step S3, based on the position information encoded by the position embedding layer in step S321 decoding the visual recognition result of the text in the visual characteristics of the text extracted by the convolution network, specifically firstly adopting the following steps oftransformerThe location information is encoded with visual characteristics of the text,transformercoding to obtain the result
Figure SMS_9
The process expression calculation formula of (2) is as follows:
Figure SMS_10
in the method, in the process of the invention,
Figure SMS_11
is the position code of the character in the sequence, and +.>
Figure SMS_12
;/>
Figure SMS_13
To identify the length of a character sequence in a text; />
Figure SMS_14
,/>
Figure SMS_15
Represents oneU-NetA network; then>
Figure SMS_16
Performing linear transformationsoftmaxAnd activating function processing to finish the recognition of the visual recognition branch and outputting a recognition result.
Further, in step S4, the specific content of the iterative correction of the visual recognition result based on the concept of semantic enhancement and graph reasoning is:
s41: when the iterative correction is carried out for the first time, the visual recognition result in the step S33 is used as the current recognition result of the recognition network;
s42: inputting the current recognition result into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43: the text embedded features in the step S42 are selected and respectively input to two semantic enhancement modules, and text semantic features are extracted by combining text semantic context information;
s44: encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45: inputting the recognition result of the semantic enhancement module in the step S44 and the visual recognition result in the step S33 into a fusion door for fusion so as to extract fusion text features;
s46, based on the fused text characteristics output in the step S45, building a correlation diagram among different characters belonging to the same text, completing identification by using the idea of diagram reasoning, and taking the identification result as a correction result of a current iteration correction branch;
s47, determining the number of times that the current correction result belongs to iterative processing in an iterative correction branch, and outputting the result of graph reasoning identification in the step S46 as a final identification result when the number of iterations reaches a preset iterative correction threshold number; when the iteration times do not reach the preset iteration correction threshold times, outputting a result of graph reasoning identification in the step S46 as a current identification result of the identification network, turning to the step S42, and entering the next iteration correction.
Further, the specific content of step S43 is:
s431: the text embedded features in the selection step S42 are respectively input into two semantic enhancement modules, wherein the data input by any semantic enhancement module is made to be
Figure SMS_17
The semantic enhancement module enhances the semantic expression process as follows:
Figure SMS_18
in the method, in the process of the invention,
Figure SMS_19
representing a convolution in the length direction of the text character sequence; />
Figure SMS_20
Representing an average pooling operation; />
Figure SMS_21
Representation ofReluActivating a function; />
Figure SMS_22
Representing a convolution pooling feature; />
Figure SMS_23
An activation expression representing a convolution feature;
s432: a corresponding semantic enhancement module obtains text semantic features after processing
Figure SMS_24
The calculation formula is as follows:
Figure SMS_25
in the method, in the process of the invention,
Figure SMS_26
expressed as an activation functionSigmod
Further, in step S44, the text semantic features in step S43 and the position coding features in step S32 are encoded and decoded, and the specific content of the recognition result of the output semantic enhancement module is:
s441: two corresponding text semantic features obtained after processing the two semantic enhancement modules
Figure SMS_27
Key of multi-head attention module as self-mask respectively +.>
Figure SMS_28
Sum->
Figure SMS_29
Simultaneously combining the position coding features output by the position embedding layer in the step S3
Figure SMS_30
Then input to the self-masking multi-head attention module for processing, and output to obtain semantic enhancement feature +.>
Figure SMS_31
The processing calculation formula is as follows:
Figure SMS_32
in the method, in the process of the invention,
Figure SMS_33
for a self-masking attention matrix, the elements on the diagonal of the matrix are minus infinity, while the elements on the off-diagonal are 0;
s442: semantic-based enhancement features
Figure SMS_34
Performing linear transformation andsoftmaxand activating function processing, and outputting a recognition result of the semantic enhancement module.
Further, in step S45, the extraction of the fused text feature
Figure SMS_35
The specific contents of (3) are as follows:
Figure SMS_36
in the method, in the process of the invention,
Figure SMS_37
is a super ginseng for training, and is a->
Figure SMS_38
Is->
Figure SMS_39
And->
Figure SMS_40
Is spliced by (a)>
Figure SMS_41
Further, in step S46, the specific contents are as follows:
s461: constructing undirected full-join graphs between different characters belonging to the same text
Figure SMS_42
Wherein->
Figure SMS_43
Nodes representing undirected full-connectivity graph, +.>
Figure SMS_44
Representing the edges of the undirected fully connected graph and treating the different characters in the text as nodes in the graph, i.e. node +.>
Figure SMS_45
Is taken from the value +.>
Figure SMS_46
While the relatedness between nodes is considered an edge; constructing a relation diagram of undirected full connection diagram>
Figure SMS_47
Expressed mathematically as:
Figure SMS_48
in the method, in the process of the invention,
Figure SMS_50
indicate->
Figure SMS_53
A plurality of nodes; />
Figure SMS_54
Indicate->
Figure SMS_51
A plurality of nodes; />
Figure SMS_52
,/>
Figure SMS_55
,/>
Figure SMS_56
And->
Figure SMS_49
Is a parameter for training;
s462: establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
Figure SMS_57
in the method, in the process of the invention,
Figure SMS_59
representing character characteristic expression after graph reasoning; />
Figure SMS_63
And->
Figure SMS_65
The residual weight matrix and the picture convolution weight matrix are respectively used for training; />
Figure SMS_60
I.e. representing->
Figure SMS_61
Which belongs to a membership matrix; since the graph winding network is multi-layered, in the first layer +.>
Figure SMS_64
The value is equivalent to the fused text feature +.>
Figure SMS_66
In the subsequent layers +.>
Figure SMS_58
Is equivalent to the value of +.>
Figure SMS_62
A value;
s463: linear transformation of character features after graph reasoningsoftmaxAnd activating function processing, and outputting text recognition based on graph reasoning.
Further, the loss function of the entire identification network is:
Figure SMS_67
in the method, in the process of the invention,
Figure SMS_70
is a feature in the visual recognition branch +.>
Figure SMS_71
Cross entropy loss of (2); />
Figure SMS_82
、/>
Figure SMS_69
、/>
Figure SMS_78
Are respectively->
Figure SMS_72
Feature in sub-iterative correction branch>
Figure SMS_81
、/>
Figure SMS_74
、/>
Figure SMS_84
Cross entropy loss of (2); />
Figure SMS_68
、/>
Figure SMS_79
、/>
Figure SMS_73
、/>
Figure SMS_85
The feature ++in the visually identified branch or the iteratively revised branch, respectively>
Figure SMS_76
、/>
Figure SMS_83
、/>
Figure SMS_75
、/>
Figure SMS_80
Balance factor of->
Figure SMS_77
Representing the character length in the text.
The beneficial effects of the invention are as follows:
1. using position codingtransformerThe structure can be used for parallel encoding and decoding in training and forward reasoning stages, and has higher time efficiency.
2. The semantic enhancement module is designed to extract semantic features of the text more effectively, and accuracy based on visual feature recognition is improved.
3. By establishing a relation diagram among different characters in the same text and adopting the explicit modeling mode, the expression of text characteristics is enhanced, and the recognition accuracy of a recognition network model is improved.
4. The current recognition is subjected to semantic enhancement and graph reasoning correction in an iterative mode, so that the recognition accuracy of the network to the scene text is further improved.
Drawings
Fig. 1 is a flow chart of the overall technical scheme of the invention.
Fig. 2 is a schematic diagram of a visual recognition branching flow scheme in a recognition network according to the present invention.
FIG. 3 is a schematic diagram of an iterative correction branch flow scheme in an identification network of the present invention.
Fig. 4 is a schematic diagram of a system framework of the identification network of the present invention.
Fig. 5 is a schematic diagram of the connection structure of each part of the system for identifying a network according to the present invention.
FIG. 6 is a schematic diagram of the internal structure of the semantic enhancement module of the present invention.
FIG. 7 is a schematic diagram of the result case of the network identification experiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
The whole technical scheme of the application is as follows:
referring to fig. 1, the application proposes a scene text recognition method based on semantic enhancement and graph reasoning, which includes:
s1, acquiring a scene text image dataset for model (recognition network) training in a preset mode;
s2, preprocessing a scene text image (image for short) needing to be input with a model;
s3, inputting the preprocessed scene text image into a visual recognition branch of a recognition network, extracting text visual characteristics of the image by a convolution network, and completing visual recognition of the scene text image based on the text visual characteristics;
s4, inputting the result of the visual identification branch into an iterative correction branch, and carrying out iterative correction on the visual identification result by using ideas based on semantic enhancement and graph reasoning.
Further, the scene text image preprocessing in step S2 mainly includes:
image random rotation, image random clipping, random brightness adjustment, random contrast adjustment, random image size adjustment, random saturation adjustment, random image gray scale adjustment, and the like.
Further, referring to fig. 2, the specific operation of step S3 is:
s31, inputting a scene text image to be recognized into a visual recognition branch, and extracting text visual characteristics by a convolution network;
s32, carrying out position coding on a character input position embedding layer in the text to be identified;
s33, decoding a visual recognition result of the text from the visual characteristics of the text extracted from the convolution network in the step S31 according to the position characteristics of the position embedding layer codes in the step S32.
The specific operation of step S33 is as follows:
s331, extracting text visual characteristics of an image by a CNN visual module;
s332, the text visual characteristics belong to U-Net network coding keys Key;
s333, taking the output characteristics of the position coding layer as the weight of text visual characteristics generated by coding the query and the Key Key;
s334, coding the weight of the text visual feature and the visual feature Value to generate the text visual feature;
s335, performing linear transformation and activation processing of an activation function softmax on the text visual characteristics to finish recognition of the visual module and output a recognition result.
Further, referring to fig. 3, the specific operation of step S4 is:
s41, in the first iterative correction, taking the visual recognition result in the step S33 as the current recognition result of the model;
s42, inputting the current recognition result of the model into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43, a text embedding feature input semantic enhancement module in the step S42 combines text semantic context information to extract text semantic features;
s44, encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45, inputting the visual recognition result in the step S33 and the semantic enhancement recognition result in the step S44 into a fusion door for fusion, and extracting fusion text features;
s46, based on the fused text characteristics output in the step S45, establishing a correlation diagram among different characters belonging to the same text, and completing identification by using the idea of diagram reasoning;
s47, outputting the result of the graph reasoning identification in the step S46 as final identification when the iteration times reach a preset iteration correction threshold;
and S48, when the iteration times do not reach the preset iteration correction threshold, outputting a result of graph reasoning identification in the step S46 as a current identification result input of the model, turning to the step S42, and entering the next iteration correction.
The specific operation of step S43 is as follows:
s431, carrying out iterative correction on the encoding of the embedded layer on the currently identified text to obtain text embedded features;
s432, enhancing text semantic features from the current text context according to the text embedded features by the SE semantic enhancement module shown in FIG. 6.
The specific operation of step S44 is as follows:
s441, decoding the output characteristics of the position coding layer and the text semantic characteristics enhanced by the semantic enhancement module by Self-masking multi-head attention by Self-masking of Self-masked Multihead Attention to obtain text decoding characteristics based on multi-head attention;
s442, performing linear transformation and activation processing of an activation function softmax on the text decoding characteristics based on the multi-head attention, and completing recognition of the semantic enhancement module (Semantic enhancement Module).
The specific operation of step S46 is as follows:
s461, regarding different features of the same text as nodes in the graph, and calculating feature similarity among the nodes, namely, the edges of the link nodes;
s462, carrying out relation reasoning of text features according to the node relation graph constructed in the S461 to obtain feature expression after the graph reasoning;
s463, performing linear transformation and activation processing of an activation function softmax on the text features after the graph enhancement, and outputting text recognition based on graph reasoning.
Further, an iteration correction branch is set in step S42, and a semantic enhancement module, a fusion gate and an inference module are set in the correction branch.
According to the method, the model is trained in an end-to-end mode by using a supervised learning mode, scene text recognition in a visual recognition branch, semantic enhancement recognition in an iterative correction branch, fusion gate recognition and graph reasoning recognition are combined in the training process, and a formed multi-objective loss function can be expressed as follows:
Figure SMS_86
in the method, in the process of the invention,
Figure SMS_90
is a feature in the visual recognition branch +.>
Figure SMS_93
Cross entropy loss of (2); />
Figure SMS_97
、/>
Figure SMS_88
、/>
Figure SMS_98
Are respectively->
Figure SMS_92
Feature in sub-iterative correction branch>
Figure SMS_99
、/>
Figure SMS_89
、/>
Figure SMS_102
Cross entropy loss of (2); />
Figure SMS_95
、/>
Figure SMS_101
、/>
Figure SMS_94
、/>
Figure SMS_104
The feature ++in the visually identified branch or the iteratively revised branch, respectively>
Figure SMS_91
、/>
Figure SMS_103
、/>
Figure SMS_87
、/>
Figure SMS_100
Balance factor of->
Figure SMS_96
Representing the character length in the text.
Meanwhile, referring to fig. 4, the application further provides a scene text recognition system based on semantic enhancement and graph reasoning, which comprises: a visual recognition branch and an iterative correction branch; the iterative correction branch comprises a semantic enhancement module, a fusion gate and a graph reasoning module.
Further, visually identifying branches, convolutionally extracting text visual features of the scene, and identifying scene text based on the text visual features; and iterating and correcting branches, correcting the recognition result of the current scene text in a semantic enhancement and graph reasoning mode, and finally outputting the recognition result of the model on the scene text.
Further, visually identifying the branch includes:
the convolution network extracts text visual characteristics of a scene;
a position embedding layer for encoding character position information in the scene text;
and then decoding a visual recognition result based on the visual characteristics and the position embedded coding information of the scene text.
Further, the iterative correction branch includes:
the semantic enhancement module is used for encoding the current recognition result of the model by the iterative correction branch embedding layer, inputting the encoded characteristics into the semantic enhancement module, enhancing the semantic expression of the text characteristics, and decoding the semantic expression of the text characteristics and the recognition result of the semantic enhancement module by the input self-masking multi-head attention module of the position embedding layer;
the fusion door fuses the recognition features of the semantic enhancement module and the recognition features of the visual recognition branches so as to comprehensively consider the visual features and the semantic features of the text;
the graph reasoning module is used for establishing a correlation graph between different characters in the same text based on the output of the fusion gate, and recognizing the scene text as a correction result of the current iteration correction branch based on the correlation graph reasoning;
when the iteration times reach a preset iteration correction threshold value, taking a graph reasoning recognition result as a final recognition; when the iteration times do not reach the preset iteration correction threshold, the result of graph reasoning recognition is input as the current recognition result of the model, and the next iteration correction is carried out. In the first iterative correction, the current recognition result of the model comes from the visual recognition branch.
The following is described in further detail:
referring to fig. 5, the network structure of the present invention is composed of two parts, a visual recognition branch and an iterative correction branch.
The operation of visually identifying the branch includes:
and (one), the convolution network extracts the text visual characteristics of the scene. The scene text image is set as
Figure SMS_105
Using convolutional networksResnetAs a network of text visual feature extraction modules, the text visual features of an extracted scene can be expressed as:
Figure SMS_106
in the method, in the process of the invention,
Figure SMS_107
representing an image; />
Figure SMS_108
Representing text visual feature extractionResnetA convolutional network; />
Figure SMS_109
,/>
Figure SMS_110
Respectively represent image +.>
Figure SMS_111
Is the height and width of (2); />
Figure SMS_112
Representing the dimensions of the visual features of the text.
And (II) the position embedding layer encodes character position information in the scene text.
And thirdly, decoding a visual recognition result based on the text visual characteristics and the position embedded coding information. UsingtransformerThe location information is encoded with visual characteristics of the text,transformerthe coding process of a cell can be formulated as:
Figure SMS_113
in the method, in the process of the invention,
Figure SMS_114
is the position code of the character in the sequence, and +.>
Figure SMS_115
;/>
Figure SMS_116
To identify the length of a character sequence in a text; />
Figure SMS_117
,/>
Figure SMS_118
Represents oneU-NetA network. Character recognition results of visual branches: />
Figure SMS_119
In the method, in the process of the invention,
Figure SMS_120
,/>
Figure SMS_121
is a trainable super-ginseng.
The operation of iteratively correcting the branch includes:
the first step, a semantic enhancement module, referring to fig. 6, encodes the current recognition result of the model by the iterative correction branch embedding layer, the encoded features are input into the semantic enhancement module, the semantic expression of the text features is enhanced, the semantic expression of the text features and the position encoding information of the position embedding layer are input into the self-masking multi-head attention module, and the recognition result of the semantic enhancement module is decoded. Assume that the input is characterized by
Figure SMS_122
The formula can be derived:
Figure SMS_123
in the method, in the process of the invention,
Figure SMS_124
representing a convolution in the length direction of the text character sequence; />
Figure SMS_125
Representing an average pooling operation; />
Figure SMS_126
Representation ofReluActivating a function; />
Figure SMS_127
Representing a convolution pooling feature; />
Figure SMS_128
An activation expression representing a convolution feature;
where semantically enhanced features (text semantic features) can be formulated as:
Figure SMS_129
in the method, in the process of the invention,
Figure SMS_130
expressed as an activation functionSigmod. Referring to FIG. 5, since there are two semantic enhancement modules, two corresponding +.>
Figure SMS_131
At the same time as self-masking attentiontransformerKey->
Figure SMS_132
Value->
Figure SMS_133
And inquire->
Figure SMS_134
The value is the code from the position in the character sequence. The text features after semantic enhancement and the input of the position embedding layer are masked from the multi-head attention, and the formula is expressed as follows:
Figure SMS_135
in the method, in the process of the invention,
Figure SMS_136
for a self-masking attention matrix, the elements on the diagonal of the matrix are minus infinity, while the elements on the off-diagonal are 0; the function is to prevent leakage of own information during bi-directional encoding. The recognition result based on the semantic enhancement features can be expressed as:
Figure SMS_137
in the method, in the process of the invention,
Figure SMS_138
is activatedRecognition result(s)>
Figure SMS_139
,/>
Figure SMS_140
Is a trainable super-ginseng.
Secondly, fusing the recognition features of the semantic enhancement module and the recognition features of the visual recognition branches to comprehensively consider the visual features and the text semantic features of the text; the process of fusing features can be formulated as:
Figure SMS_141
in the method, in the process of the invention,
Figure SMS_142
is a super ginseng for training, and is a->
Figure SMS_143
Is->
Figure SMS_144
And->
Figure SMS_145
Is spliced by (a)>
Figure SMS_146
Thirdly, a graph reasoning module establishes a correlation graph between different characters in the same text based on the output of the fusion gate, and based on the correlation graph, the scene text is recognized in a reasoning mode as a correction result of the current iteration correction branch;
building undirected full join graphs between text characters
Figure SMS_147
. In (1) the->
Figure SMS_148
Nodes representing undirected full-connectivity graph, +.>
Figure SMS_149
Representing the edges of the undirected fully connected graph and treating the different characters in the text as nodes in the graph, i.e. node +.>
Figure SMS_150
Is taken from the value +.>
Figure SMS_151
While the relatedness between nodes is considered an edge; constructing a relation diagram of undirected full connection diagram>
Figure SMS_152
Expressed mathematically as:
Figure SMS_153
in the method, in the process of the invention,
Figure SMS_155
indicate->
Figure SMS_157
A plurality of nodes; />
Figure SMS_159
Indicate->
Figure SMS_156
A plurality of nodes; />
Figure SMS_158
,/>
Figure SMS_160
,/>
Figure SMS_161
And->
Figure SMS_154
Is a parameter for training;
establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
Figure SMS_162
in the method, in the process of the invention,
Figure SMS_163
representing character characteristic expression after graph reasoning; />
Figure SMS_164
And->
Figure SMS_165
The residual weight matrix and the picture convolution weight matrix are respectively used for training; then, the character characteristics after graph reasoning are subjected to linear transformationsoftmaxActivating function processing, and outputting text recognition based on graph reasoning:
Figure SMS_166
when the iteration times reach a preset iteration correction threshold value, taking a graph reasoning recognition result as a final recognition; when the iteration times do not reach the preset iteration correction threshold, the result of graph reasoning recognition is input as the current recognition result of the model, and the next iteration correction is carried out. The first iterative correction is that the current recognition result of the model comes from the visual recognition branch.
As shown in fig. 7, the text under each picture example, the first is a real text label, the second is the recognition result of the ABINet model, and the third is the recognition result of the implementation example herein; it can be seen from the figure that the invention has better text recognition performance than the ABINet model.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims (7)

1. A scene text recognition method based on semantic enhancement and graph reasoning is characterized by comprising the following steps:
s1: acquiring a scene text image dataset for identifying network training;
s2: preprocessing the acquired images in the scene text image dataset;
s3: inputting the preprocessed image into a visual recognition branch of a recognition network, extracting text visual features in the image by the visual recognition branch through an internally arranged convolution network, and then completing visual recognition of the image based on the text visual features; the specific contents are as follows:
s31: inputting an image to be identified into a visual identification branch, and extracting text visual characteristics in the image by a convolution network; wherein, the convolution network adopts a Resnet convolution network, and the Resnet convolution network is used for extracting text visual characteristics R in the image v The expression formula is:
Figure FDA0004230805090000011
wherein I represents an image; r is R net () Representing a Resnet convolutional network for text visual feature extraction; h, W represent the height and width of the image I, respectively; c represents the dimension of the visual feature of the text;
s32: inputting characters in a text to be recognized in an image to a position embedding layer for position coding;
s33: decoding a visual recognition result of a text from the text visual features extracted from the convolutional network in the step S31 according to the position information encoded by the position embedding layer in the step S32, specifically, encoding the position information and the text visual features by using a transducer, and obtaining a result g by using the transducer to encode v The' process expression calculation formula is:
Figure FDA0004230805090000012
wherein Q is the position code of the character in the sequence, and Q epsilon T x C; t is the length of a character sequence in the identification text; k' =u (R v '), U () represents a U-Net network; and then data result g v ' performing linear transformation and softmax activation function processing to complete the recognition of the visual recognition branch and output a recognition result;
s4: inputting the output result of the visual recognition branch into an iterative correction branch of the recognition network, performing iterative correction on the visual recognition result by using ideas based on semantic enhancement and graph reasoning, and outputting a final recognition result to complete the training of the recognition network; the method comprises the following steps of:
s41: when the iterative correction is carried out for the first time, the visual recognition result in the step S33 is used as the current recognition result of the recognition network;
s42: inputting the current recognition result into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43: the text embedded features in the step S42 are selected and respectively input to two semantic enhancement modules, and text semantic features are extracted by combining text semantic context information;
s44: encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45: inputting the recognition result of the semantic enhancement module in the step S44 and the visual recognition result in the step S33 into a fusion door for fusion so as to extract fusion text features;
s46, based on the fused text characteristics output in the step S45, building a correlation diagram among different characters belonging to the same text, completing identification by using the idea of diagram reasoning, and taking the identification result as a correction result of a current iteration correction branch;
s47, determining the number of times that the current correction result belongs to iterative processing in an iterative correction branch, and outputting the result of graph reasoning identification in the step S46 as a final identification result when the number of iterations reaches a preset iterative correction threshold number; when the iteration times do not reach the preset iteration correction threshold times, outputting a result of the graph reasoning identification in the step S46 as a current identification result of the identification network, turning to the step S42, and entering the next iteration correction;
s5: the trained recognition network is used for recognizing the scene text.
2. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 1, wherein in step S2, the specific content of the preprocessing includes:
the image is randomly rotated, randomly sheared, randomly adjusted for brightness, randomly adjusted for contrast, randomly adjusted for size, randomly adjusted for saturation, randomly adjusted for gray value.
3. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 1, wherein the specific content of step S43 is:
s431: the text embedded features in the selection step S42 are respectively input into two semantic enhancement modules, wherein the data input by any semantic enhancement module is made to be F '' r The semantic enhancement module enhances the semantic expression process as follows:
Figure FDA0004230805090000021
where CNN () represents a convolution in the length direction of a text character sequence; pool () represents an average pooling operation; delta () represents the Relu activation function; f'. c Representing a convolution pooling feature; e (E) δ An activation expression representing a convolution feature;
s432: a corresponding semantic enhancement module is processed to obtain text semantic features F' c The calculation formula is as follows:
F′ e =F r ′*σ(CNN 2 (E δ ))+F r
in the formula, σ () is expressed as an activation function Sigmod.
4. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 3, wherein in step S44, the text semantic features in step S43 and the position coding features in step S32 are encoded and decoded, and the specific content of the recognition result of the output semantic enhancement module is:
s441: two corresponding text semantic features F 'obtained after processing the two semantic enhancement modules' e The key K and the value V respectively used as the self-masking multi-head attention module are combined with the position coding feature Q output by the position embedding layer in the step S3, and then are input into the self-masking multi-head attention module together for processing, and the semantic enhancement feature g 'is obtained through output' L The processing calculation formula is as follows:
Figure FDA0004230805090000031
where M is a self-masking attention matrix with elements on diagonal of minus infinity and elements on non-diagonal of 0;
s442: based on semantic enhancement feature g' L And performing linear transformation and softmax activation function processing on the data, and outputting a recognition result of the semantic enhancement module.
5. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 4, wherein in step S45, the specific content of the extracted fusion text feature F is:
Figure FDA0004230805090000032
in which W is f ∈R 2C×C Is a super-ginseng for training, [ g ]' v ,g′ L ]Is g' v And g' L W is equal to g ′∈R T×C
6. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 5, wherein in step S46, the specific contents are:
s461: constructing an undirected full-connection graph G= (V ', E) among different characters belonging to the same text, wherein V ' represents a node of the undirected full-connection graph, E represents an edge of the undirected full-connection graph, and different characters in the text are regarded as nodes in the graph, namely the node V ' is taken from a value F, and the relativity between the nodes is regarded as an edge; construction of relation diagram E (V) of undirected full connection diagram i ′,V j ') expressed mathematically as:
Figure FDA0004230805090000033
wherein V is i ' represents the ith node; v'. j Represents a j-th node;
Figure FDA0004230805090000034
φ(V′ j )=W′ φ V′ j ,/>
Figure FDA0004230805090000035
and->
Figure FDA0004230805090000036
Is a parameter for training;
s462: establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
V g =W r (EV″W g )+V″
wherein V is g Representing character characteristic expression after graph reasoning; w (W) r And W is g The residual weight matrix and the picture convolution weight matrix are respectively used for training; e represents E (V) i ′,V j '), which belongs to a membership matrix; since the graph-rolling network is multi-layered, the value of V' at the first layer is equivalent to the fused text feature F, in the next fewThe value of V' in a layer is equal to V of the preceding layer g A value;
s463: and performing linear transformation and softmax activation function processing on character features after graph reasoning, and outputting text recognition based on the graph reasoning.
7. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 6, wherein the loss function of the whole recognition network is:
Figure FDA0004230805090000041
wherein L' v Is the feature g 'in the visually identified branch' v Cross entropy loss of (2);
Figure FDA0004230805090000042
the ith iteration identifies feature g 'in the branch' L 、F、V g Cross entropy loss of (2); r's' v 、r′ L 、r′ f 、r′ g Corresponding to the feature g 'in the visual recognition branch or the iterative recognition branch, respectively' v 、g′ L 、F、V g N represents the character length in the text.
CN202310341392.7A 2023-04-03 2023-04-03 Scene text recognition method based on semantic enhancement and graph reasoning Active CN116052154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310341392.7A CN116052154B (en) 2023-04-03 2023-04-03 Scene text recognition method based on semantic enhancement and graph reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310341392.7A CN116052154B (en) 2023-04-03 2023-04-03 Scene text recognition method based on semantic enhancement and graph reasoning

Publications (2)

Publication Number Publication Date
CN116052154A CN116052154A (en) 2023-05-02
CN116052154B true CN116052154B (en) 2023-06-16

Family

ID=86118635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310341392.7A Active CN116052154B (en) 2023-04-03 2023-04-03 Scene text recognition method based on semantic enhancement and graph reasoning

Country Status (1)

Country Link
CN (1) CN116052154B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710986B (en) * 2024-02-01 2024-04-30 长威信息科技发展股份有限公司 Method and system for identifying interactive enhanced image text based on mask

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033249A (en) * 2019-12-09 2021-06-25 中兴通讯股份有限公司 Character recognition method, device, terminal and computer storage medium thereof
CN112733768B (en) * 2021-01-15 2022-09-09 中国科学技术大学 Natural scene text recognition method and device based on bidirectional characteristic language model
CN114092930B (en) * 2022-01-07 2022-05-03 中科视语(北京)科技有限公司 Character recognition method and system

Also Published As

Publication number Publication date
CN116052154A (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN112232149B (en) Document multimode information and relation extraction method and system
CN108388900A (en) The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN109919174A (en) A kind of character recognition method based on gate cascade attention mechanism
CN111783705A (en) Character recognition method and system based on attention mechanism
CN113284100B (en) Image quality evaluation method based on recovery image to mixed domain attention mechanism
CN116052154B (en) Scene text recognition method based on semantic enhancement and graph reasoning
CN115964467A (en) Visual situation fused rich semantic dialogue generation method
CN110033008A (en) A kind of iamge description generation method concluded based on modal transformation and text
CN113064968B (en) Social media emotion analysis method and system based on tensor fusion network
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN110175248A (en) A kind of Research on face image retrieval and device encoded based on deep learning and Hash
CN112949608A (en) Pedestrian re-identification method based on twin semantic self-encoder and branch fusion
CN112528643A (en) Text information extraction method and device based on neural network
CN114462420A (en) False news detection method based on feature fusion model
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN115205640A (en) Rumor detection-oriented multi-level image-text fusion method and system
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN116030537B (en) Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture
Liu et al. Capsule embedded resnet for image classification
CN114821569A (en) Scene text recognition method and system based on attention mechanism
CN113722536A (en) Video description method based on bilinear adaptive feature interaction and target perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant