CN116052154B - Scene text recognition method based on semantic enhancement and graph reasoning - Google Patents
Scene text recognition method based on semantic enhancement and graph reasoning Download PDFInfo
- Publication number
- CN116052154B CN116052154B CN202310341392.7A CN202310341392A CN116052154B CN 116052154 B CN116052154 B CN 116052154B CN 202310341392 A CN202310341392 A CN 202310341392A CN 116052154 B CN116052154 B CN 116052154B
- Authority
- CN
- China
- Prior art keywords
- text
- recognition
- semantic
- visual
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a scene text recognition method based on semantic enhancement and graph reasoning, which relates to the technical field of machine vision and natural language, and comprises a vision recognition branch and an iteration correction branch; the visual recognition branch extracts visual characteristics of the scene text by a convolution network, and performs parallel coding and decoding on the visual characteristics; the iterative correction branch corrects the current recognition result by setting a semantic enhancement module, a fusion door and an reasoning module, wherein the semantic enhancement module enhances text semantic features by utilizing a character context in the text, and the recognition precision is improved; the fusion door comprehensively considers visual and semantic information by fusing text semantic features and text visual features of the recognition module; the graph reasoning module establishes an association relation between text characters, performs reasoning according to the association relation, corrects text characters with unobvious characteristics, and further improves the recognition accuracy of scene texts. The method and the device improve the recognition accuracy of the network to the scene text.
Description
Technical Field
The invention relates to the technical field of machine vision and natural language processing, in particular to a scene text recognition method based on semantic enhancement and graph reasoning.
Background
The characters are one of the biggest inventions of human beings and are vectors of human civilization. More than 70% of the information obtained by humans from the outside comes from the visual system. Vision and text are the most common ways for humans to record and express cognition to the world. How scene text is identified from a visual image is critical to properly understanding the image. Scene text recognition is one of research hotspots for visual understanding, and has wide application prospects in computer vision, such as automatic driving, blind person assistance, intelligent transportation, visual text questions and answers and the like. Therefore, scene text recognition has received a great deal of attention.
The approach of scene text recognition has made significant progress over the past decades, but still faces significant challenges. The text in the scene text image has the characteristics of text diversity and background complexity. For example, the scene text is complex and changeable in terms of font size, color, shape, aspect ratio and the like, and the image of the scene text has the problems of shielding, distortion, blurring, uneven illumination, low text resolution, strong text background interference and the like. In order to solve the problems and further improve the recognition accuracy of the scene text, a large number of scholars continuously propose new scene text recognition algorithms.
Before the neural network is widely used, the traditional text recognition method requires a designer to put efforts on the design of text features, the advantages and disadvantages of the text features depend on the experience of the designer, and the advantages and disadvantages of the text features are important to the recognition effect of the text, so that the text recognition accuracy of the method is lower, and the robustness is poor. The wide application of neural networks enables designers to put more effort on model design, greatly improving the accuracy of text recognition. The scene text recognition method based on the neural network regards the recognition task as a sequence generation task, generally adopts a convolution network to extract visual characteristics, and then decodes the visual characteristics by using a language model to generate text recognition content. Two classes can be distinguished, based on recurrent neural networks (Recurrent Neural Network, RNN) and on transformer, according to the type of decoder. The recurrent neural network-based method decodes text from visual features in a recursive manner, which cannot be decoded in parallel in both training and forward reasoning stages, and thus the decoding speed is slow. In addition, RNN was originally proposed for processing one-dimensional feature information, and it is not suitable to directly process two-dimensional information of text in a scene text image, and this method has low recognition accuracy for the scene text. The method based on the transformer can train in parallel, so that training efficiency of the model is improved, but the forward reasoning stage cannot decode in parallel. The current neural network-based method has less exploration of semantic information in scene texts, and when the scene text images have the conditions of low resolution, poor illumination quality, random and changeable scene text shapes and the like, the problems of low scene text recognition precision, low recognition speed and the like are caused.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a scene text recognition method based on semantic enhancement and graph reasoning, which is used for improving the accuracy of the original scene text recognition method on scene text recognition.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a scene text recognition method based on semantic enhancement and graph reasoning comprises the following steps:
s1: acquiring a scene text image dataset for identifying network training;
s2: preprocessing the acquired images in the scene text image dataset;
s3: inputting the preprocessed image into a visual recognition branch of a recognition network, extracting text visual features in the image by the visual recognition branch through an internally arranged convolution network, and then completing visual recognition of the image based on the text visual features;
s4: inputting the output result of the visual recognition branch into an iterative correction branch of the recognition network, performing iterative correction on the visual recognition result by using ideas based on semantic enhancement and graph reasoning, and outputting a final recognition result to complete the training of the recognition network;
s5: the trained recognition network is used for recognizing the scene text.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, in step S2, the specific content of the preprocessing includes:
the image is randomly rotated, randomly sheared, randomly adjusted for brightness, randomly adjusted for contrast, randomly adjusted for size, randomly adjusted for saturation, randomly adjusted for gray value.
Further, in step S3, the visual recognition branch extracts text visual features in the image through the internally-arranged convolution network, and then completes visual recognition of the image based on the text visual features as follows:
s31: inputting an image to be identified into a visual identification branch, and extracting text visual characteristics in the image by a convolution network; wherein the convolution network adoptsResnetConvolutional network, usingResnetConvolutional network extraction of text visual features in imagesThe expression calculation formula is:
in the method, in the process of the invention,representing an image; />Representing text visual feature extractionResnetA convolutional network; />,/>Respectively represent image +.>Is the height and width of (2); />A dimension representing a visual feature of the text;
s32: inputting characters in a text to be recognized in an image to a position embedding layer for position coding;
s33: from step S3, based on the position information encoded by the position embedding layer in step S321 decoding the visual recognition result of the text in the visual characteristics of the text extracted by the convolution network, specifically firstly adopting the following steps oftransformerThe location information is encoded with visual characteristics of the text,transformercoding to obtain the resultThe process expression calculation formula of (2) is as follows:
in the method, in the process of the invention,is the position code of the character in the sequence, and +.>;/>To identify the length of a character sequence in a text; />,/>Represents oneU-NetA network; then>Performing linear transformationsoftmaxAnd activating function processing to finish the recognition of the visual recognition branch and outputting a recognition result.
Further, in step S4, the specific content of the iterative correction of the visual recognition result based on the concept of semantic enhancement and graph reasoning is:
s41: when the iterative correction is carried out for the first time, the visual recognition result in the step S33 is used as the current recognition result of the recognition network;
s42: inputting the current recognition result into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43: the text embedded features in the step S42 are selected and respectively input to two semantic enhancement modules, and text semantic features are extracted by combining text semantic context information;
s44: encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45: inputting the recognition result of the semantic enhancement module in the step S44 and the visual recognition result in the step S33 into a fusion door for fusion so as to extract fusion text features;
s46, based on the fused text characteristics output in the step S45, building a correlation diagram among different characters belonging to the same text, completing identification by using the idea of diagram reasoning, and taking the identification result as a correction result of a current iteration correction branch;
s47, determining the number of times that the current correction result belongs to iterative processing in an iterative correction branch, and outputting the result of graph reasoning identification in the step S46 as a final identification result when the number of iterations reaches a preset iterative correction threshold number; when the iteration times do not reach the preset iteration correction threshold times, outputting a result of graph reasoning identification in the step S46 as a current identification result of the identification network, turning to the step S42, and entering the next iteration correction.
Further, the specific content of step S43 is:
s431: the text embedded features in the selection step S42 are respectively input into two semantic enhancement modules, wherein the data input by any semantic enhancement module is made to beThe semantic enhancement module enhances the semantic expression process as follows:
in the method, in the process of the invention,representing a convolution in the length direction of the text character sequence; />Representing an average pooling operation; />Representation ofReluActivating a function; />Representing a convolution pooling feature; />An activation expression representing a convolution feature;
s432: a corresponding semantic enhancement module obtains text semantic features after processingThe calculation formula is as follows:
Further, in step S44, the text semantic features in step S43 and the position coding features in step S32 are encoded and decoded, and the specific content of the recognition result of the output semantic enhancement module is:
s441: two corresponding text semantic features obtained after processing the two semantic enhancement modulesKey of multi-head attention module as self-mask respectively +.>Sum->Simultaneously combining the position coding features output by the position embedding layer in the step S3Then input to the self-masking multi-head attention module for processing, and output to obtain semantic enhancement feature +.>The processing calculation formula is as follows:
in the method, in the process of the invention,for a self-masking attention matrix, the elements on the diagonal of the matrix are minus infinity, while the elements on the off-diagonal are 0;
s442: semantic-based enhancement featuresPerforming linear transformation andsoftmaxand activating function processing, and outputting a recognition result of the semantic enhancement module.
Further, in step S45, the extraction of the fused text featureThe specific contents of (3) are as follows:
in the method, in the process of the invention,is a super ginseng for training, and is a->Is->And->Is spliced by (a)>。
Further, in step S46, the specific contents are as follows:
s461: constructing undirected full-join graphs between different characters belonging to the same textWherein->Nodes representing undirected full-connectivity graph, +.>Representing the edges of the undirected fully connected graph and treating the different characters in the text as nodes in the graph, i.e. node +.>Is taken from the value +.>While the relatedness between nodes is considered an edge; constructing a relation diagram of undirected full connection diagram>Expressed mathematically as:
in the method, in the process of the invention,indicate->A plurality of nodes; />Indicate->A plurality of nodes; />,/>,/>And->Is a parameter for training;
s462: establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
in the method, in the process of the invention,representing character characteristic expression after graph reasoning; />And->The residual weight matrix and the picture convolution weight matrix are respectively used for training; />I.e. representing->Which belongs to a membership matrix; since the graph winding network is multi-layered, in the first layer +.>The value is equivalent to the fused text feature +.>In the subsequent layers +.>Is equivalent to the value of +.>A value;
s463: linear transformation of character features after graph reasoningsoftmaxAnd activating function processing, and outputting text recognition based on graph reasoning.
Further, the loss function of the entire identification network is:
in the method, in the process of the invention,is a feature in the visual recognition branch +.>Cross entropy loss of (2); />、/>、/>Are respectively->Feature in sub-iterative correction branch>、/>、/>Cross entropy loss of (2); />、/>、/>、/>The feature ++in the visually identified branch or the iteratively revised branch, respectively>、/>、/>、/>Balance factor of->Representing the character length in the text.
The beneficial effects of the invention are as follows:
1. using position codingtransformerThe structure can be used for parallel encoding and decoding in training and forward reasoning stages, and has higher time efficiency.
2. The semantic enhancement module is designed to extract semantic features of the text more effectively, and accuracy based on visual feature recognition is improved.
3. By establishing a relation diagram among different characters in the same text and adopting the explicit modeling mode, the expression of text characteristics is enhanced, and the recognition accuracy of a recognition network model is improved.
4. The current recognition is subjected to semantic enhancement and graph reasoning correction in an iterative mode, so that the recognition accuracy of the network to the scene text is further improved.
Drawings
Fig. 1 is a flow chart of the overall technical scheme of the invention.
Fig. 2 is a schematic diagram of a visual recognition branching flow scheme in a recognition network according to the present invention.
FIG. 3 is a schematic diagram of an iterative correction branch flow scheme in an identification network of the present invention.
Fig. 4 is a schematic diagram of a system framework of the identification network of the present invention.
Fig. 5 is a schematic diagram of the connection structure of each part of the system for identifying a network according to the present invention.
FIG. 6 is a schematic diagram of the internal structure of the semantic enhancement module of the present invention.
FIG. 7 is a schematic diagram of the result case of the network identification experiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
The whole technical scheme of the application is as follows:
referring to fig. 1, the application proposes a scene text recognition method based on semantic enhancement and graph reasoning, which includes:
s1, acquiring a scene text image dataset for model (recognition network) training in a preset mode;
s2, preprocessing a scene text image (image for short) needing to be input with a model;
s3, inputting the preprocessed scene text image into a visual recognition branch of a recognition network, extracting text visual characteristics of the image by a convolution network, and completing visual recognition of the scene text image based on the text visual characteristics;
s4, inputting the result of the visual identification branch into an iterative correction branch, and carrying out iterative correction on the visual identification result by using ideas based on semantic enhancement and graph reasoning.
Further, the scene text image preprocessing in step S2 mainly includes:
image random rotation, image random clipping, random brightness adjustment, random contrast adjustment, random image size adjustment, random saturation adjustment, random image gray scale adjustment, and the like.
Further, referring to fig. 2, the specific operation of step S3 is:
s31, inputting a scene text image to be recognized into a visual recognition branch, and extracting text visual characteristics by a convolution network;
s32, carrying out position coding on a character input position embedding layer in the text to be identified;
s33, decoding a visual recognition result of the text from the visual characteristics of the text extracted from the convolution network in the step S31 according to the position characteristics of the position embedding layer codes in the step S32.
The specific operation of step S33 is as follows:
s331, extracting text visual characteristics of an image by a CNN visual module;
s332, the text visual characteristics belong to U-Net network coding keys Key;
s333, taking the output characteristics of the position coding layer as the weight of text visual characteristics generated by coding the query and the Key Key;
s334, coding the weight of the text visual feature and the visual feature Value to generate the text visual feature;
s335, performing linear transformation and activation processing of an activation function softmax on the text visual characteristics to finish recognition of the visual module and output a recognition result.
Further, referring to fig. 3, the specific operation of step S4 is:
s41, in the first iterative correction, taking the visual recognition result in the step S33 as the current recognition result of the model;
s42, inputting the current recognition result of the model into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43, a text embedding feature input semantic enhancement module in the step S42 combines text semantic context information to extract text semantic features;
s44, encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45, inputting the visual recognition result in the step S33 and the semantic enhancement recognition result in the step S44 into a fusion door for fusion, and extracting fusion text features;
s46, based on the fused text characteristics output in the step S45, establishing a correlation diagram among different characters belonging to the same text, and completing identification by using the idea of diagram reasoning;
s47, outputting the result of the graph reasoning identification in the step S46 as final identification when the iteration times reach a preset iteration correction threshold;
and S48, when the iteration times do not reach the preset iteration correction threshold, outputting a result of graph reasoning identification in the step S46 as a current identification result input of the model, turning to the step S42, and entering the next iteration correction.
The specific operation of step S43 is as follows:
s431, carrying out iterative correction on the encoding of the embedded layer on the currently identified text to obtain text embedded features;
s432, enhancing text semantic features from the current text context according to the text embedded features by the SE semantic enhancement module shown in FIG. 6.
The specific operation of step S44 is as follows:
s441, decoding the output characteristics of the position coding layer and the text semantic characteristics enhanced by the semantic enhancement module by Self-masking multi-head attention by Self-masking of Self-masked Multihead Attention to obtain text decoding characteristics based on multi-head attention;
s442, performing linear transformation and activation processing of an activation function softmax on the text decoding characteristics based on the multi-head attention, and completing recognition of the semantic enhancement module (Semantic enhancement Module).
The specific operation of step S46 is as follows:
s461, regarding different features of the same text as nodes in the graph, and calculating feature similarity among the nodes, namely, the edges of the link nodes;
s462, carrying out relation reasoning of text features according to the node relation graph constructed in the S461 to obtain feature expression after the graph reasoning;
s463, performing linear transformation and activation processing of an activation function softmax on the text features after the graph enhancement, and outputting text recognition based on graph reasoning.
Further, an iteration correction branch is set in step S42, and a semantic enhancement module, a fusion gate and an inference module are set in the correction branch.
According to the method, the model is trained in an end-to-end mode by using a supervised learning mode, scene text recognition in a visual recognition branch, semantic enhancement recognition in an iterative correction branch, fusion gate recognition and graph reasoning recognition are combined in the training process, and a formed multi-objective loss function can be expressed as follows:
in the method, in the process of the invention,is a feature in the visual recognition branch +.>Cross entropy loss of (2); />、/>、/>Are respectively->Feature in sub-iterative correction branch>、/>、/>Cross entropy loss of (2); />、/>、/>、/>The feature ++in the visually identified branch or the iteratively revised branch, respectively>、/>、/>、/>Balance factor of->Representing the character length in the text.
Meanwhile, referring to fig. 4, the application further provides a scene text recognition system based on semantic enhancement and graph reasoning, which comprises: a visual recognition branch and an iterative correction branch; the iterative correction branch comprises a semantic enhancement module, a fusion gate and a graph reasoning module.
Further, visually identifying branches, convolutionally extracting text visual features of the scene, and identifying scene text based on the text visual features; and iterating and correcting branches, correcting the recognition result of the current scene text in a semantic enhancement and graph reasoning mode, and finally outputting the recognition result of the model on the scene text.
Further, visually identifying the branch includes:
the convolution network extracts text visual characteristics of a scene;
a position embedding layer for encoding character position information in the scene text;
and then decoding a visual recognition result based on the visual characteristics and the position embedded coding information of the scene text.
Further, the iterative correction branch includes:
the semantic enhancement module is used for encoding the current recognition result of the model by the iterative correction branch embedding layer, inputting the encoded characteristics into the semantic enhancement module, enhancing the semantic expression of the text characteristics, and decoding the semantic expression of the text characteristics and the recognition result of the semantic enhancement module by the input self-masking multi-head attention module of the position embedding layer;
the fusion door fuses the recognition features of the semantic enhancement module and the recognition features of the visual recognition branches so as to comprehensively consider the visual features and the semantic features of the text;
the graph reasoning module is used for establishing a correlation graph between different characters in the same text based on the output of the fusion gate, and recognizing the scene text as a correction result of the current iteration correction branch based on the correlation graph reasoning;
when the iteration times reach a preset iteration correction threshold value, taking a graph reasoning recognition result as a final recognition; when the iteration times do not reach the preset iteration correction threshold, the result of graph reasoning recognition is input as the current recognition result of the model, and the next iteration correction is carried out. In the first iterative correction, the current recognition result of the model comes from the visual recognition branch.
The following is described in further detail:
referring to fig. 5, the network structure of the present invention is composed of two parts, a visual recognition branch and an iterative correction branch.
The operation of visually identifying the branch includes:
and (one), the convolution network extracts the text visual characteristics of the scene. The scene text image is set asUsing convolutional networksResnetAs a network of text visual feature extraction modules, the text visual features of an extracted scene can be expressed as:
in the method, in the process of the invention,representing an image; />Representing text visual feature extractionResnetA convolutional network; />,/>Respectively represent image +.>Is the height and width of (2); />Representing the dimensions of the visual features of the text.
And (II) the position embedding layer encodes character position information in the scene text.
And thirdly, decoding a visual recognition result based on the text visual characteristics and the position embedded coding information. UsingtransformerThe location information is encoded with visual characteristics of the text,transformerthe coding process of a cell can be formulated as:
in the method, in the process of the invention,is the position code of the character in the sequence, and +.>;/>To identify the length of a character sequence in a text; />,/>Represents oneU-NetA network. Character recognition results of visual branches: />;
The operation of iteratively correcting the branch includes:
the first step, a semantic enhancement module, referring to fig. 6, encodes the current recognition result of the model by the iterative correction branch embedding layer, the encoded features are input into the semantic enhancement module, the semantic expression of the text features is enhanced, the semantic expression of the text features and the position encoding information of the position embedding layer are input into the self-masking multi-head attention module, and the recognition result of the semantic enhancement module is decoded. Assume that the input is characterized byThe formula can be derived:
in the method, in the process of the invention,representing a convolution in the length direction of the text character sequence; />Representing an average pooling operation; />Representation ofReluActivating a function; />Representing a convolution pooling feature; />An activation expression representing a convolution feature;
where semantically enhanced features (text semantic features) can be formulated as:
in the method, in the process of the invention,expressed as an activation functionSigmod. Referring to FIG. 5, since there are two semantic enhancement modules, two corresponding +.>At the same time as self-masking attentiontransformerKey->Value->And inquire->The value is the code from the position in the character sequence. The text features after semantic enhancement and the input of the position embedding layer are masked from the multi-head attention, and the formula is expressed as follows:
in the method, in the process of the invention,for a self-masking attention matrix, the elements on the diagonal of the matrix are minus infinity, while the elements on the off-diagonal are 0; the function is to prevent leakage of own information during bi-directional encoding. The recognition result based on the semantic enhancement features can be expressed as:
in the method, in the process of the invention,is activatedRecognition result(s)>,/>Is a trainable super-ginseng.
Secondly, fusing the recognition features of the semantic enhancement module and the recognition features of the visual recognition branches to comprehensively consider the visual features and the text semantic features of the text; the process of fusing features can be formulated as:
in the method, in the process of the invention,is a super ginseng for training, and is a->Is->And->Is spliced by (a)>。
Thirdly, a graph reasoning module establishes a correlation graph between different characters in the same text based on the output of the fusion gate, and based on the correlation graph, the scene text is recognized in a reasoning mode as a correction result of the current iteration correction branch;
building undirected full join graphs between text characters. In (1) the->Nodes representing undirected full-connectivity graph, +.>Representing the edges of the undirected fully connected graph and treating the different characters in the text as nodes in the graph, i.e. node +.>Is taken from the value +.>While the relatedness between nodes is considered an edge; constructing a relation diagram of undirected full connection diagram>Expressed mathematically as:
in the method, in the process of the invention,indicate->A plurality of nodes; />Indicate->A plurality of nodes; />,/>,/>And->Is a parameter for training;
establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
in the method, in the process of the invention,representing character characteristic expression after graph reasoning; />And->The residual weight matrix and the picture convolution weight matrix are respectively used for training; then, the character characteristics after graph reasoning are subjected to linear transformationsoftmaxActivating function processing, and outputting text recognition based on graph reasoning:
when the iteration times reach a preset iteration correction threshold value, taking a graph reasoning recognition result as a final recognition; when the iteration times do not reach the preset iteration correction threshold, the result of graph reasoning recognition is input as the current recognition result of the model, and the next iteration correction is carried out. The first iterative correction is that the current recognition result of the model comes from the visual recognition branch.
As shown in fig. 7, the text under each picture example, the first is a real text label, the second is the recognition result of the ABINet model, and the third is the recognition result of the implementation example herein; it can be seen from the figure that the invention has better text recognition performance than the ABINet model.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.
Claims (7)
1. A scene text recognition method based on semantic enhancement and graph reasoning is characterized by comprising the following steps:
s1: acquiring a scene text image dataset for identifying network training;
s2: preprocessing the acquired images in the scene text image dataset;
s3: inputting the preprocessed image into a visual recognition branch of a recognition network, extracting text visual features in the image by the visual recognition branch through an internally arranged convolution network, and then completing visual recognition of the image based on the text visual features; the specific contents are as follows:
s31: inputting an image to be identified into a visual identification branch, and extracting text visual characteristics in the image by a convolution network; wherein, the convolution network adopts a Resnet convolution network, and the Resnet convolution network is used for extracting text visual characteristics R in the image v The expression formula is:
wherein I represents an image; r is R net () Representing a Resnet convolutional network for text visual feature extraction; h, W represent the height and width of the image I, respectively; c represents the dimension of the visual feature of the text;
s32: inputting characters in a text to be recognized in an image to a position embedding layer for position coding;
s33: decoding a visual recognition result of a text from the text visual features extracted from the convolutional network in the step S31 according to the position information encoded by the position embedding layer in the step S32, specifically, encoding the position information and the text visual features by using a transducer, and obtaining a result g by using the transducer to encode v The' process expression calculation formula is:
wherein Q is the position code of the character in the sequence, and Q epsilon T x C; t is the length of a character sequence in the identification text; k' =u (R v '), U () represents a U-Net network; and then data result g v ' performing linear transformation and softmax activation function processing to complete the recognition of the visual recognition branch and output a recognition result;
s4: inputting the output result of the visual recognition branch into an iterative correction branch of the recognition network, performing iterative correction on the visual recognition result by using ideas based on semantic enhancement and graph reasoning, and outputting a final recognition result to complete the training of the recognition network; the method comprises the following steps of:
s41: when the iterative correction is carried out for the first time, the visual recognition result in the step S33 is used as the current recognition result of the recognition network;
s42: inputting the current recognition result into an iterative correction branch embedding layer to obtain text embedding characteristics;
s43: the text embedded features in the step S42 are selected and respectively input to two semantic enhancement modules, and text semantic features are extracted by combining text semantic context information;
s44: encoding and decoding the text semantic features in the step S43 and the position encoding features in the step S32, and outputting the recognition result of the semantic enhancement module;
s45: inputting the recognition result of the semantic enhancement module in the step S44 and the visual recognition result in the step S33 into a fusion door for fusion so as to extract fusion text features;
s46, based on the fused text characteristics output in the step S45, building a correlation diagram among different characters belonging to the same text, completing identification by using the idea of diagram reasoning, and taking the identification result as a correction result of a current iteration correction branch;
s47, determining the number of times that the current correction result belongs to iterative processing in an iterative correction branch, and outputting the result of graph reasoning identification in the step S46 as a final identification result when the number of iterations reaches a preset iterative correction threshold number; when the iteration times do not reach the preset iteration correction threshold times, outputting a result of the graph reasoning identification in the step S46 as a current identification result of the identification network, turning to the step S42, and entering the next iteration correction;
s5: the trained recognition network is used for recognizing the scene text.
2. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 1, wherein in step S2, the specific content of the preprocessing includes:
the image is randomly rotated, randomly sheared, randomly adjusted for brightness, randomly adjusted for contrast, randomly adjusted for size, randomly adjusted for saturation, randomly adjusted for gray value.
3. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 1, wherein the specific content of step S43 is:
s431: the text embedded features in the selection step S42 are respectively input into two semantic enhancement modules, wherein the data input by any semantic enhancement module is made to be F '' r The semantic enhancement module enhances the semantic expression process as follows:
where CNN () represents a convolution in the length direction of a text character sequence; pool () represents an average pooling operation; delta () represents the Relu activation function; f'. c Representing a convolution pooling feature; e (E) δ An activation expression representing a convolution feature;
s432: a corresponding semantic enhancement module is processed to obtain text semantic features F' c The calculation formula is as follows:
F′ e =F r ′*σ(CNN 2 (E δ ))+F r ′
in the formula, σ () is expressed as an activation function Sigmod.
4. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 3, wherein in step S44, the text semantic features in step S43 and the position coding features in step S32 are encoded and decoded, and the specific content of the recognition result of the output semantic enhancement module is:
s441: two corresponding text semantic features F 'obtained after processing the two semantic enhancement modules' e The key K and the value V respectively used as the self-masking multi-head attention module are combined with the position coding feature Q output by the position embedding layer in the step S3, and then are input into the self-masking multi-head attention module together for processing, and the semantic enhancement feature g 'is obtained through output' L The processing calculation formula is as follows:
where M is a self-masking attention matrix with elements on diagonal of minus infinity and elements on non-diagonal of 0;
s442: based on semantic enhancement feature g' L And performing linear transformation and softmax activation function processing on the data, and outputting a recognition result of the semantic enhancement module.
5. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 4, wherein in step S45, the specific content of the extracted fusion text feature F is:
in which W is f ∈R 2C×C Is a super-ginseng for training, [ g ]' v ,g′ L ]Is g' v And g' L W is equal to g ′∈R T×C 。
6. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 5, wherein in step S46, the specific contents are:
s461: constructing an undirected full-connection graph G= (V ', E) among different characters belonging to the same text, wherein V ' represents a node of the undirected full-connection graph, E represents an edge of the undirected full-connection graph, and different characters in the text are regarded as nodes in the graph, namely the node V ' is taken from a value F, and the relativity between the nodes is regarded as an edge; construction of relation diagram E (V) of undirected full connection diagram i ′,V j ') expressed mathematically as:
wherein V is i ' represents the ith node; v'. j Represents a j-th node;φ(V′ j )=W′ φ V′ j ,/>and->Is a parameter for training;
s462: establishing residual links, performing graph reasoning by using a graph convolution network method, wherein the calculation process is as follows:
V g =W r (EV″W g )+V″
wherein V is g Representing character characteristic expression after graph reasoning; w (W) r And W is g The residual weight matrix and the picture convolution weight matrix are respectively used for training; e represents E (V) i ′,V j '), which belongs to a membership matrix; since the graph-rolling network is multi-layered, the value of V' at the first layer is equivalent to the fused text feature F, in the next fewThe value of V' in a layer is equal to V of the preceding layer g A value;
s463: and performing linear transformation and softmax activation function processing on character features after graph reasoning, and outputting text recognition based on the graph reasoning.
7. The scene text recognition method based on semantic enhancement and graph reasoning according to claim 6, wherein the loss function of the whole recognition network is:
wherein L' v Is the feature g 'in the visually identified branch' v Cross entropy loss of (2);the ith iteration identifies feature g 'in the branch' L 、F、V g Cross entropy loss of (2); r's' v 、r′ L 、r′ f 、r′ g Corresponding to the feature g 'in the visual recognition branch or the iterative recognition branch, respectively' v 、g′ L 、F、V g N represents the character length in the text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310341392.7A CN116052154B (en) | 2023-04-03 | 2023-04-03 | Scene text recognition method based on semantic enhancement and graph reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310341392.7A CN116052154B (en) | 2023-04-03 | 2023-04-03 | Scene text recognition method based on semantic enhancement and graph reasoning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116052154A CN116052154A (en) | 2023-05-02 |
CN116052154B true CN116052154B (en) | 2023-06-16 |
Family
ID=86118635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310341392.7A Active CN116052154B (en) | 2023-04-03 | 2023-04-03 | Scene text recognition method based on semantic enhancement and graph reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116052154B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117710986B (en) * | 2024-02-01 | 2024-04-30 | 长威信息科技发展股份有限公司 | Method and system for identifying interactive enhanced image text based on mask |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113033249A (en) * | 2019-12-09 | 2021-06-25 | 中兴通讯股份有限公司 | Character recognition method, device, terminal and computer storage medium thereof |
CN112733768B (en) * | 2021-01-15 | 2022-09-09 | 中国科学技术大学 | Natural scene text recognition method and device based on bidirectional characteristic language model |
CN114092930B (en) * | 2022-01-07 | 2022-05-03 | 中科视语(北京)科技有限公司 | Character recognition method and system |
-
2023
- 2023-04-03 CN CN202310341392.7A patent/CN116052154B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN116052154A (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112232149B (en) | Document multimode information and relation extraction method and system | |
CN108388900A (en) | The video presentation method being combined based on multiple features fusion and space-time attention mechanism | |
CN112733768B (en) | Natural scene text recognition method and device based on bidirectional characteristic language model | |
CN109919174A (en) | A kind of character recognition method based on gate cascade attention mechanism | |
CN111783705A (en) | Character recognition method and system based on attention mechanism | |
CN113284100B (en) | Image quality evaluation method based on recovery image to mixed domain attention mechanism | |
CN116052154B (en) | Scene text recognition method based on semantic enhancement and graph reasoning | |
CN115964467A (en) | Visual situation fused rich semantic dialogue generation method | |
CN110033008A (en) | A kind of iamge description generation method concluded based on modal transformation and text | |
CN113064968B (en) | Social media emotion analysis method and system based on tensor fusion network | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN107463928A (en) | Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM | |
CN110175248A (en) | A kind of Research on face image retrieval and device encoded based on deep learning and Hash | |
CN112949608A (en) | Pedestrian re-identification method based on twin semantic self-encoder and branch fusion | |
CN112528643A (en) | Text information extraction method and device based on neural network | |
CN114462420A (en) | False news detection method based on feature fusion model | |
CN109766918A (en) | Conspicuousness object detecting method based on the fusion of multi-level contextual information | |
CN115205640A (en) | Rumor detection-oriented multi-level image-text fusion method and system | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN116030537B (en) | Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution | |
CN116311493A (en) | Two-stage human-object interaction detection method based on coding and decoding architecture | |
Liu et al. | Capsule embedded resnet for image classification | |
CN114821569A (en) | Scene text recognition method and system based on attention mechanism | |
CN113722536A (en) | Video description method based on bilinear adaptive feature interaction and target perception |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |