CN111507328A - Text recognition and model training method, system, equipment and readable storage medium - Google Patents

Text recognition and model training method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN111507328A
CN111507328A CN202010270210.8A CN202010270210A CN111507328A CN 111507328 A CN111507328 A CN 111507328A CN 202010270210 A CN202010270210 A CN 202010270210A CN 111507328 A CN111507328 A CN 111507328A
Authority
CN
China
Prior art keywords
dimensional
image
image features
feature
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010270210.8A
Other languages
Chinese (zh)
Inventor
邬国锐
卿山
王庆庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aikaka Information Technology Co ltd
Original Assignee
Beijing Aikaka Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aikaka Information Technology Co ltd filed Critical Beijing Aikaka Information Technology Co ltd
Priority to CN202010270210.8A priority Critical patent/CN111507328A/en
Publication of CN111507328A publication Critical patent/CN111507328A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06KGRAPHICAL DATA READING; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K7/00Methods or arrangements for sensing record carriers, e.g. for reading patterns
    • G06K7/10Methods or arrangements for sensing record carriers, e.g. for reading patterns by electromagnetic radiation, e.g. optical sensing; by corpuscular radiation
    • G06K7/14Methods or arrangements for sensing record carriers, e.g. for reading patterns by electromagnetic radiation, e.g. optical sensing; by corpuscular radiation using light without selection of wavelength, e.g. sensing reflected white light
    • G06K7/1404Methods for optical code recognition
    • G06K7/1439Methods for optical code recognition including a method step for retrieval of the optical code
    • G06K7/1443Methods for optical code recognition including a method step for retrieval of the optical code locating of the code in an image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Electromagnetism (AREA)
  • Toxicology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a text recognition and model training method, a system, equipment and a readable storage medium, wherein in the encoding stage of text recognition, the image characteristics of a picture to be recognized are extracted through a dense convolutional neural network, so that the extracted characteristics are more abstract, and the included semantic information is richer; the image features containing the position information are generated by adding the two-dimensional position code information into the image features, and the added two-dimensional position code can more accurately position the positions of characters in the image when the image features are decoded, so that the corresponding text characters can be more accurately identified, and the accuracy of identifying the bent text can be improved; in the decoding stage, the image features containing the position information are decoded by a transform decoding layer containing a two-dimensional attention mechanism, the two-dimensional space information of the image can be fully utilized, training is carried out by using a weak supervision mode, and the accuracy of the identification of the bent text can be further improved.

Description

Text recognition and model training method, system, equipment and readable storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a text recognition and model training method, a system, equipment and a readable storage medium.
Background
In daily work or life, computer technology is often used to identify text on paper documents, such as characters on various bills, identity information on document entities, and the like, and image-based character identification has become an important research topic in computer vision.
At present, Optical Character Recognition (OCR) technology is mainly used for recognizing text information printed on paper, which uses Optical technology and computer technology to read out characters printed or written on paper and convert the characters into a format that can be understood by people. The OCR processing steps mainly comprise: image preprocessing, layout analysis, text positioning (or image cutting), character cutting and recognition, and the like.
Due to the fact that text fonts and text shapes in natural scenes are various, and the situations of covering, uneven illumination, excessive noise and the like exist, especially for many curved texts in the natural scenes, such as curved trademarks, seals and the like, very important information is often contained, and the requirement on accuracy of recognition is high. However, in the prior art, the accuracy of recognizing the curved text in the natural scene is low, and how to improve the accuracy of recognizing the curved text in the natural scene becomes a technical problem to be solved urgently.
Disclosure of Invention
The invention provides a text recognition and model training method, a system, equipment and a readable storage medium, which are used for overcoming the technical problems in the prior art and improving the accuracy of the recognition of a bent text in a natural scene.
The invention provides a text recognition method, which comprises the following steps:
extracting image features of the picture to be identified through a dense convolutional neural network;
adding two-dimensional position coding information to the image characteristics to generate image characteristics containing position information;
and decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain a recognition result.
The present invention also provides a text recognition model comprising: an encoding module and a decoding module; the encoding module is configured to: extracting image features of a picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features, and generating image features containing position information; the decoding module comprises a transformer decoding layer containing a two-dimensional attention mechanism, and the transformer decoding layer containing the two-dimensional attention mechanism is used for decoding the image features containing the position information to obtain an identification result;
the method comprises the following steps:
acquiring a training set for natural scene text recognition, wherein the training set at least comprises a plurality of pieces of bent text training data, and each piece of bent text training data comprises: the method comprises the steps of obtaining a sample picture containing a bent text and corresponding text marking information;
and training the text recognition model through the training set.
The present invention also provides a text recognition system comprising:
the coding module is used for extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features and generating image features containing position information;
and the decoding module is used for decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain an identification result.
The present invention also provides a text recognition apparatus, comprising:
a processor, a memory, and a computer program stored on the memory and executable on the processor; when the processor runs the computer program, the text recognition method and/or the text recognition model training method are/is realized.
The present invention also provides a computer-readable storage medium storing a computer program which can be executed to execute the above-described text recognition method and/or text recognition model training method.
In the encoding stage of text recognition, the image features of the picture to be recognized are extracted through the dense convolutional neural network, so that the extracted features are more abstract, and the included semantic information is more abundant; the image features containing the position information are generated by adding two-dimensional position coding information into the image features, and the added two-dimensional position codes can more accurately position the positions of characters in the image when the image features are decoded, so that the corresponding text characters can be more accurately identified, and the accuracy of identifying the bent text can be improved; in the decoding stage, the image features containing the position information are decoded by a transform decoding layer containing a two-dimensional attention mechanism, so that the two-dimensional spatial information of the image can be fully utilized, training is carried out by using a weak supervision mode, and the accuracy of the identification of the bent text can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a text recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a conventional transformer model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text recognition model according to an embodiment of the present invention;
fig. 4 is a flowchart of adding a two-dimensional position code according to a second embodiment of the present invention;
fig. 5 is a flow chart of two-dimensional attention vector determination according to a third embodiment of the present invention;
fig. 6 is a schematic diagram of a two-dimensional attention vector determination process according to a third embodiment of the present invention;
fig. 7 is a flowchart of a text recognition model training method according to a fourth embodiment of the present invention;
fig. 8 is a schematic structural diagram of a text recognition system according to a fifth embodiment of the present invention;
fig. 9 is a schematic structural diagram of a text recognition system according to a sixth embodiment of the present invention;
fig. 10 is a schematic structural diagram of a text recognition apparatus according to a seventh embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first", "second", etc. referred to in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
At present, in the field of scene curved text recognition, difficulties mainly exist in the "alignment" (hereinafter referred to as "alignment") of each text character with an image text region, that is, how to accurately recognize the text characters in the image text region. The above "alignment" operation is relatively simple compared to curved text for conventional straight text. For the technical difficulty, the invention adopts the following four ways to perform the "alignment" operation on the text region: extracting image features by using a convolutional neural network, adding a two-dimensional position code into the extracted image features, extracting correlation between characters by using a transform decoding layer (namely, a transform-decoder) and realizing the alignment operation with the image features, wherein a two-dimensional attention module is adopted for the alignment of the character features and the image features. The convolutional neural network extraction image feature and the transform-decoder are basic modules, the two-dimensional attention module is a core aiming at the alignment of text characters and image text regions in the transform-decoder, and the two-dimensional position coding is specially added aiming at the two-dimensional attention module, so that the alignment effect can be enhanced.
The text recognition method provided by this embodiment is implemented by using a text recognition model, the model architecture adopted is an encoding (encoder) -decoding (decoder) architecture, and the text recognition model includes an encoding module and a decoding module. In the encoder stage, firstly, the image characteristics of the picture to be identified are extracted through a convolutional neural network, and then two-dimensional position coding is added. And in the decoder stage, receiving the output from the encoder through a transform-decoder, and decoding to obtain an identification result by adopting a two-dimensional attention mechanism.
In order to make the technical solution of the present invention clearer, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a text recognition method according to a first embodiment of the present invention, fig. 2 is a schematic structural diagram of a conventional transform model according to a first embodiment of the present invention, and fig. 3 is a schematic structural diagram of a text recognition model according to a first embodiment of the present invention. As shown in fig. 1, the text recognition method in this embodiment includes:
and step 10, extracting the image characteristics of the picture to be identified through a dense convolutional neural network.
Human perception of images is abstractly layered, firstly understanding color and brightness, then local detail features such as edges, corners, lines and the like, then more complex information and structures such as textures, geometric shapes and the like, and finally forming the concept of the whole object. The study of the visual mechanism by the visual neuroscience verifies the conclusion that the visual cortex of the animal brain has a layered structure. The convolutional neural network can be regarded as a simulation of human visual mechanism and is composed of a plurality of convolutional layers, each convolutional layer is formed by scanning a picture from left to right and from top to bottom through a convolutional kernel, and a feature map is output, namely local features of the picture are extracted. With the gradual increase of the convolution layer, the receptive field (the size of the area of each pixel in the feature map mapped on the input picture) also gradually increases, and simultaneously, the extracted features are more abstract, and finally, abstract representations of the image in different scales are obtained. Ever since convolutional neural networks have been highlighted in image recognition challenges of 2012, convolutional neural networks are continuously developed and widely used in various fields, achieve the best performance in many problems, and are now the mainstream for extracting image features.
In this embodiment, the image features of the picture to be recognized are extracted through a Dense Convolutional neural Network (densneet for short), so as to obtain the image features of the picture to be recognized, that is, the feature map of the picture to be recognized.
Illustratively, before text recognition is performed, model training may be performed on a text recognition model in advance, training of the encoding module and the decoding module is realized in the model training process, a trained dense convolutional neural network for extracting image features in an encoding stage may be obtained, and a trained transform decoding layer including a two-dimensional attention mechanism is obtained.
In the step, the image characteristics of the picture to be identified are extracted through a pre-trained dense convolutional neural network.
Step 20, adding two-dimensional position code information to the image feature to generate an image feature including position information.
the transform model is composed entirely of Attention (Attention) mechanisms, which were proposed by Bengio team in 2014 and have been widely used in various fields in deep learning in recent years, such as for capturing the receptive field on an image in the visual direction of a computer, or for locating key tokens or features in Natural language processing (Natural L and facility processing, N L P). the transform discards the conventional Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), the entire Network structure is composed entirely of Attention mechanisms, but the Attention mechanism does not contain position information, i.e., the words in a word at different positions are not distinguished in the transform, which is certainly not practical.
In this embodiment, after the image features (that is, feature maps) of the picture to be recognized are extracted, two-dimensional position coding information is added to the image features, so that the position representation of a two-dimensional space in the image features can be enhanced, and the capability of "aligning" the image features and the character features can be further enhanced.
And step 30, decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain a recognition result.
The whole network structure of the traditional transform model is composed of an Attention mechanism, and a transform coding layer is composed of a Self-Attention mechanism (Self-Attention) layer and a Feedforward Neural Network (FNN). As shown in fig. 2, the conventional transform coding layer (encorder #1 and encorder #2 shown in fig. 2 indicate two coding layers) includes a Self-attention layer and a Feed-Forward network layer (Feed Forward), the conventional transform decoding layer (2X in fig. 2 indicates two decoding layer stacks, each shown in a dashed box of a decoding module part) includes a Self-attention layer, an Encoder-decoder attention layer and a Feed-Forward network layer (Feed Forward), and an "Add & normal" layer is located in the middle of each sub-layer (including the Self-attention layer, the Encoder-decoder attention layer and the Feed-Forward network layer) in the transform decoding layer, and indicates a residual join and a layer normalization step. A transformer-based trainable neural network can be built by stacking transformer layers. the two defects of the RNN are solved by the aid of the transform, RNN related algorithms can be sequentially calculated from left to right or from right to left, and the method has the advantages that the parallel capability of the model is limited, and information of a long-term dependence phenomenon can be lost in the sequential calculation process. The transformer is not an RNN-like sequential structure, so that the parallelism is better, and the distance between any two positions in the sequence is reduced to a constant, so that the problem of long-term dependence is solved.
The coding module in the embodiment adopts DenseNet to extract image characteristics and adds two-dimensional position coding information; the decoding module in this embodiment is formed by stacking a plurality of transform decoding layers, an output of a previous transform decoding layer is used as an input of a next transform decoding layer, and each transform decoding layer includes: a multi-head attention layer with a mask, a two-dimensional attention layer and a feedforward neural network layer.
For example, as shown in fig. 3, the transform decoding module in this embodiment may be formed by stacking two or three transform decoding layers, "3X" in fig. 3 represents 3 transform decoding layer stacks, "Masked Multi-Head Attention" in fig. 3 represents a Multi-Head Attention layer with a mask, "2D Attention" represents a two-dimensional Attention layer, and "Feed Forward" represents a Feed Forward neural network layer.
As shown in FIG. 3, there is an "Add & Normal" layer in between each sub-layer (masked multi-headed attention layer, two-dimensional attention layer and feedforward neural network layer) in the transform decoding layer, representing the residual concatenation and layer normalization steps.
In addition, for model training, the transform decoding module further comprises an embedding layer. In the training stage, the truth characters are converted into vector representation in a high-dimensional space through an embedding layer, and the vector representation is used as input of a first transform decoding layer.
In this embodiment, a transform decoding module is used in the decoder stage to extract character features and "align" with the image features extracted in the encoder stage. Unlike the conventional transform decoding module, the encoder-decoder attribute in the original "alignment" operation is replaced by a two-dimensional Attention mechanism in this embodiment to further enhance the alignment capability of the character features extracted in the decoding stage and the image features extracted in the encoding stage, and finally output through a feedforward neural network.
The traditional Attention module for realizing the 'alignment' operation in text recognition in a natural scene needs to vertically pool the extracted image features in the encoder stage, so that the spatial information in the image is lost, and the two-dimensional spatial information cannot be well utilized. In the embodiment, a 2D authorization mechanism for the curved text is used, so that the spatial information of the image can be fully utilized, and the character features and the image features are aligned in a weak supervision mode to realize the identification of the curved text.
In the encoding stage of text recognition, the image features of the picture to be recognized are extracted through the dense convolutional neural network, so that the extracted features are more abstract, and the included semantic information is richer; the image features containing the position information are generated by adding the two-dimensional position code information into the image features, and the added two-dimensional position code can more accurately position the positions of characters in the image when the image features are decoded, so that the accuracy of the identification of the bent text can be improved; in the decoding stage, the image features containing the position information are decoded by a transform decoding layer containing a two-dimensional attention mechanism, the two-dimensional space information of the image can be fully utilized, training is carried out by using a weak supervision mode, and the accuracy of the identification of the bent text can be further improved.
Fig. 4 is a flowchart of adding a two-dimensional position code according to a second embodiment of the present invention, and based on the first embodiment, in this embodiment, as shown in fig. 4, the step 20 adds two-dimensional position code information to an image feature to generate an image feature including position information, which can be implemented by the following steps 201 and 203:
step 201, generating a two-dimensional position code for each pixel in the image feature.
Specifically, this step can be implemented as follows:
determining position coding weights in the horizontal direction and the vertical direction according to the image characteristics; for any pixel in the image characteristics, respectively generating one-dimensional position codes of the pixel in the horizontal direction and the vertical direction; and according to the position coding weights in the horizontal direction and the vertical direction, carrying out weighted summation on the one-dimensional position codes of the pixel in the horizontal direction and the vertical direction to obtain the two-dimensional position code of the pixel.
Illustratively, for a pixel at a high h width w position in an image feature (i.e., a feature map), the two-dimensional position coding of the pixel may be PhwIndicating that one-dimensional position encoding of the pixel in the vertical direction can be used
Figure BDA0002447584510000081
Where the subscript h denotes the vertical direction, pos _ h is the arrangement position of the pixel in the vertical direction, and one-dimensional position coding of the pixel in the horizontal direction can be used
Figure BDA0002447584510000082
Where the subscript w represents the horizontal direction, pos _ w is the arrangement position of the pixel in the horizontal direction, and the depth of the feature map can be represented by D, then the one-dimensional position code of the pixel in the vertical direction can be calculated by using the following formula one or formula two:
Figure BDA0002447584510000083
Figure BDA0002447584510000084
where pos _ h is an arrangement position of the pixel in the vertical direction, for example, if there are 20 pixels in the vertical direction in the feature map, the value of pos _ h is [0, 19 ]];
Figure BDA0002447584510000085
Represents the 2 i-th component of the one-dimensional position-coding vector corresponding to the pos h-th pixel in the vertical direction,
Figure BDA0002447584510000086
represents a one-dimensional bit corresponding to the pos _ h pixel in the vertical directionSetting the 2i +1 th component in the coding vector, i is a non-negative integer, and subscripts 2i and 2i +1 take the values of [0, D-1 ]]Wherein D is the depth of the feature map, that is, the component value is calculated by the formula one for the even (2i) component of the one-dimensional position-coding vector corresponding to the pixel, and the component value is calculated by the formula two for the odd (2i +1) component of the one-dimensional position-coding vector corresponding to the pixel.
The reason for taking the trigonometric function is
Figure BDA0002447584510000087
And
Figure BDA0002447584510000088
the linear relationship given by k can represent the relative positional relationship of the pixel points in the vertical direction.
Similarly, the one-dimensional position code of the pixel in the horizontal direction can be calculated by the following formula three or formula four:
Figure BDA0002447584510000091
Figure BDA0002447584510000092
where pos _ w is an arrangement position of the pixel in the horizontal direction, for example, if there are 48 pixels in the horizontal direction in the feature map, the value of pos _ w is [0, 47 ]];
Figure BDA0002447584510000093
Represents the 2 i-th component of the one-dimensional position-coding vector corresponding to the pos _ w-th pixel in the horizontal direction,
Figure BDA0002447584510000094
represents the 2i +1 component of the one-dimensional position-coding vector corresponding to the pos _ w pixel in the horizontal direction, i is a non-negative integer, and the subscripts 2i and 2i +1 take the values of [0, D-1 ]]D is the depth of the feature map, namely the even number (2i) component of the one-dimensional position coding vector corresponding to the pixel calculates the component value through the formula III, and the one-dimensional position corresponding to the pixelThe odd (2i +1) component of the encoded vector is evaluated by equation four.
For example, the following formula five may be used to determine the position coding weight of the image feature in the horizontal direction, and the following formula six may be used to determine the position coding weight of the image feature in the vertical direction:
Figure BDA0002447584510000095
Figure BDA0002447584510000096
wherein α (E) and β (E) respectively represent the position-coding weights in the vertical direction and the horizontal direction,
Figure BDA0002447584510000097
Figure BDA0002447584510000098
for learnable linear weights, the preferred values can be obtained by model training, and g (e) is the result after average pooling of the whole feature map.
Further, after determining the position coding weights in the horizontal direction and the vertical direction and the one-dimensional position coding of the pixel in the horizontal direction and the vertical direction, the following formula seven may be adopted to determine the two-dimensional position coding of the pixel:
Figure BDA0002447584510000099
wherein, PhwRepresenting a two-dimensional position encoding of a pixel,
Figure BDA00024475845100000910
representing a one-dimensional position encoding of the pixel in the vertical direction,
Figure BDA00024475845100000911
representing a one-dimensional position encoding of the pixel in the horizontal direction.
Step 202, a position encoding tensor of the image feature is generated.
The two-dimensional position code of each pixel is a vector, after the two-dimensional position code vector of each pixel in the image characteristics is obtained, the two-dimensional position code vectors of all the pixels are spliced into a tensor, and the position of the two-dimensional position code vector of each pixel in the tensor corresponds to the position of a pixel point in the image.
Step 203, adding the position coding tensor of the image feature and the image feature to obtain the image feature containing the position information.
In this embodiment, a two-dimensional position code is added to the image feature in the encoding stage for the two-dimensional attention module in the decoding stage, and the position representation of the two-dimensional space in the image feature is enhanced, so that the two-dimensional attention module can exert a better effect, thereby enhancing the "alignment" capability of text characters and image text regions.
On the basis of the first embodiment or the second embodiment, in this embodiment, as shown in fig. 3, the text recognition model includes at least one transform decoding layer including a two-dimensional attention mechanism, and each transform decoding layer includes: a multi-head attention layer with a mask, a two-dimensional attention layer and a feedforward neural network layer.
The processing process of each transform decoding layer comprises the following steps: processing the input character features through a multi-head attention layer with a mask to obtain first character features; determining a two-dimensional attention vector through a two-dimensional attention layer according to the image feature containing the position information and the first character feature, and adding the two-dimensional attention vector to the first character feature to obtain a second character feature; the second character features are input into a feed-forward neural network layer.
Further, the first character feature may include one or more feature vectors, and if the first character feature includes a plurality of feature vectors, the two-dimensional attention layer determines, according to the image feature including the position information and the first character feature, a two-dimensional attention vector corresponding to each feature vector of the first character feature, and then adds the corresponding two-dimensional attention vector to each feature vector of the first character feature to obtain a second character feature.
Fig. 5 is a text recognition flowchart provided in a third embodiment of the present invention, and fig. 6 is a schematic diagram of a two-dimensional attention vector determination flowchart provided in the third embodiment of the present invention. As shown in fig. 5 and fig. 6, the two-dimensional attention layer determines the two-dimensional attention vector according to the image feature including the position information and the first character feature, which may specifically be implemented by the following steps 301 and 303:
step 301, performing a first convolution process on the image features including the position information to obtain a first tensor of H × W × d, where H, W and d respectively represent the height, width and depth of the first tensor.
Here, the first convolution processing refers to convolution of the input image feature by 3 × 3.
Step 302, the first character feature includes at least one feature vector, and a weight value of each feature vector of the first character feature with respect to the image feature including the location information is determined according to the first vector, where the weight value of each feature vector of the first character feature with respect to the image feature including the location information is a weight value of each pixel point of the image feature including the location information.
Specifically, the first character feature may include one or more feature vectors, and a weight value of each feature vector of the first character feature with respect to the image feature including the position information may be determined according to the first amount.
In this embodiment, as shown in fig. 6, for each feature vector of the first character feature, specifically, determining a weight value of the feature vector with respect to the image feature including the position information may be implemented as follows:
the feature vector is subjected to second convolution processing (convolution through 1 × 1 as shown in fig. 6) to obtain a second tensor of 1 × 1 × 0d, the height and the width of the second tensor are expanded to obtain a third tensor of H × 1W × d, the third tensor is consistent with the height, the width and the depth of the first tensor, the third tensor is added with the first tensor and is processed by an activation function (exemplarily illustrated by adopting a tanh function in fig. 6) to obtain a fourth tensor of H × W × d, the fourth tensor is subjected to third convolution processing (convolution through 1 × 1 as shown in fig. 6) to obtain a fifth tensor of H × W × 1 (not shown in fig. 6), and the fifth tensor is subjected to two-dimensional softmax processing to obtain a weight value of the feature vector about the image feature containing the position information, wherein the weight value of the feature vector is the tensor of H × W × 1.
The activation function may be a hyperbolic tangent function (tanh function), a sigmoid function, or other similar activation functions, and this embodiment is not limited in particular.
And 303, carrying out weighted summation on the image features containing the position information according to the weight value of each pixel point to obtain a two-dimensional attention vector corresponding to each feature vector of the first character features.
There may be one or more feature vectors of the first character feature, and generally, the number of feature vectors in the first character feature is consistent with the number of characters.
In this step, for each feature vector of the first character feature, a weight value of the feature vector with respect to the image feature including the position information is taken as a weight value of each pixel point of the image feature including the position information, and the image features including the position information are weighted and summed according to the weight value of each pixel point to obtain a two-dimensional attention vector corresponding to the feature vector.
The embodiment provides a specific implementation manner for determining a two-dimensional Attention vector, and a 2D Attention mechanism for a curved text is used in the embodiment, so that spatial information of an image can be fully utilized, a weak supervision manner is used for "aligning" character features and image features, the alignment capability of the character features extracted in a decoding stage and the image features extracted in an encoding stage is further enhanced, and the accuracy of curved text recognition can be further improved.
Fig. 7 is a flowchart of a text recognition model training method according to a fourth embodiment of the present invention. The embodiment provides a training method of a text recognition model, wherein the text recognition model comprises the following steps: an encoding module and a decoding module; the encoding module is configured to: extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features, and generating image features containing position information; the decoding module comprises a transformer decoding layer containing a two-dimensional attention mechanism, and the transformer decoding layer containing the two-dimensional attention mechanism is used for decoding the image features containing the position information to obtain the identification result.
The text recognition model provided in this embodiment is used to implement the text recognition method provided in any of the above embodiments, and specific implementation manners of the text recognition method are described in detail in the above embodiments, which are not described herein again.
In this embodiment, before the text recognition method is executed, the text recognition model may be trained in advance, as shown in fig. 7, the text recognition model training method specifically includes the following steps:
step 40, obtaining a training set for natural scene text recognition, wherein the training set at least comprises a plurality of pieces of bent text training data, and each piece of bent text training data comprises: and the sample picture containing the bent text and the corresponding text labeling information.
Illustratively, in the embodiment, the training set includes sample pictures in natural scenes and text labeling information thereof, which are as rich as possible.
And 50, training the text recognition model through a training set.
The coding module in the embodiment adopts DenseNet to extract image characteristics and adds two-dimensional position coding information; the decoding module in this embodiment is formed by stacking a plurality of transform decoding layers, an output of a previous transform decoding layer is used as an input of a next transform decoding layer, and each transform decoding layer includes: a multi-head attention layer with a mask, a two-dimensional attention layer and a feedforward neural network layer.
For example, as shown in fig. 3, the transform decoding module in this embodiment may be formed by stacking two or three transform decoding layers, "3X" in fig. 3 represents 3 transform decoding layer stacks, "Masked Multi-Head Attention" in fig. 3 represents a Multi-Head Attention layer with a mask, "2D Attention" represents a two-dimensional Attention layer, and "Feed Forward" represents a Feed Forward neural network layer.
As shown in FIG. 3, there is an "Add & Normal" layer in between each sub-layer (masked multi-headed attention layer, two-dimensional attention layer and feedforward neural network layer) in the transform decoding layer, representing the residual concatenation and layer normalization steps.
In addition, the transform decoding module further includes an embedding (embedding) layer.
In the training stage, the truth characters are converted into vector representation in a high-dimensional space through an embedding layer, and the vector representation is used as input of a first transform decoding layer.
The text recognition model provided in this embodiment is essentially a classification model, and the final output result of the decoder is a probability tensor, so in this embodiment, a multi-classification cross entropy loss function may be used to calculate the model loss, where the formula of the loss function is H (p, q) ═ ∑ p (x) logq (x), where p (x) represents 1 when the correct answer is correct, and 0 (x) otherwise represents the prediction probability of the correct answer term.
In addition, in this embodiment, Adam is finally used as an optimization method to optimize the model, and Adam is a first-order optimization algorithm that can replace a conventional Stochastic Gradient Descent (SGD) process, and can iteratively update the neural network weights based on training data. The Adam algorithm differs from the traditional random gradient descent. The stochastic gradient descent keeps a single learning rate (i.e., alpha) updating all weights, and the learning rate does not change during the training process. Adam, in turn, designs independent adaptive learning rates for different parameters by computing first and second order moment estimates of the gradient. Meanwhile, the Adam algorithm is easy to implement, and has high computational efficiency and low memory requirement. Therefore, the present embodiment preferably employs the Adam algorithm as the optimizer.
Fig. 8 is a schematic structural diagram of a text recognition system according to a fifth embodiment of the present invention, and as shown in fig. 8, the text recognition system in the present embodiment includes: an encoding module 801 and a decoding module 802.
Specifically, the encoding module 801 is configured to extract image features of a picture to be identified through a dense convolutional neural network, add two-dimensional position encoding information to the image features, and generate image features including position information.
The decoding module 802 is configured to perform decoding processing on the image features including the position information through a transform decoding layer including a two-dimensional attention mechanism, so as to obtain an identification result.
The above functional modules are respectively used for completing a corresponding operation function of the method embodiment of the present invention, and similar functional effects are also achieved, and detailed descriptions are omitted.
On the basis of the fifth embodiment, in this embodiment, the encoding module 801 is further configured to: generating a two-dimensional position code for each pixel in the image feature and generating a position code tensor for the image feature; and adding the position coding tensor of the image features and the image features to obtain the image features containing the position information.
Optionally, the encoding module 801 is further configured to: determining position coding weights in the horizontal direction and the vertical direction according to the image characteristics; for any pixel in the image characteristics, respectively generating one-dimensional position codes of the pixel in the horizontal direction and the vertical direction; and according to the position coding weights in the horizontal direction and the vertical direction, carrying out weighted summation on the one-dimensional position codes of the pixel in the horizontal direction and the vertical direction to obtain the two-dimensional position code of the pixel.
In this embodiment, the text recognition model includes at least one transform decoding layer including a two-dimensional attention mechanism, and each transform decoding layer includes: a multi-head attention layer with a mask, a two-dimensional attention layer and a feedforward neural network layer.
Specifically, the decoding module 502 is further configured to:
processing the input character features through a multi-head attention layer with a mask to obtain first character features; determining a two-dimensional attention vector through a two-dimensional attention layer according to the image feature containing the position information and the first character feature, and adding the two-dimensional attention vector to the first character feature to obtain a second character feature; the second character features are input into a feed-forward neural network layer.
Optionally, the decoding module 502 is further configured to perform a first convolution process on the image features including the position information to obtain a H × W × d first tensor, where H, W and d respectively represent the height, the width, and the depth of the first tensor, the first character features include at least one feature vector, determine, according to the first vector, a weight value of each feature vector of the first character features with respect to the image features including the position information, where the weight value of each feature vector of the first character features with respect to the image features including the position information is a weight value of each pixel of the image features including the position information, and perform weighted summation on the image features including the position information according to the weight values of each pixel to obtain a two-dimensional attention vector corresponding to each feature vector of the first character features.
Optionally, the decoding module 502 is further configured to:
the method comprises the steps of performing second convolution processing on an eigenvector to obtain a second tensor of 1 × 1 × d, expanding the height and the width of the second tensor to obtain a third tensor of H × W × d, enabling the third tensor to be consistent with the height, the width and the depth of the first tensor, adding the third tensor and the first tensor, processing by adopting an activation function to obtain a fourth tensor of H × W × d, performing third convolution processing on the fourth tensor to obtain a fifth tensor of H × W × 1, and performing two-dimensional softmax processing on the fifth tensor to obtain a weight value of the eigenvector relative to image features containing position information.
The above functional modules are respectively used for completing the operation functions corresponding to the second embodiment and the third embodiment of the method of the present invention, and similar functional effects are also achieved, and detailed descriptions are omitted.
Fig. 9 is a schematic structural diagram of a text recognition system according to a sixth embodiment of the present invention, and on the basis of the fifth embodiment, in another embodiment of the present invention, as shown in fig. 9, the text recognition system may further include a model training module 803. The text recognition model includes: an encoding module 801 and a decoding module 802; the encoding module 801 is configured to: extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features, and generating image features containing position information; the decoding module 802 includes a transform decoding layer including a two-dimensional attention mechanism, and the transform decoding layer including the two-dimensional attention mechanism is used for performing decoding processing on the image features including the position information to obtain an identification result. The model training module 803 is used to: acquiring a training set for natural scene text recognition, wherein the training set at least comprises a plurality of pieces of bent text training data, and each piece of bent text training data comprises: the method comprises the steps of obtaining a sample picture containing a bent text and corresponding text marking information; and training the text recognition model through a training set.
In addition, in another embodiment of the present invention, the model training module may be implemented as a single system.
The encoding module 801 and the decoding module 802 are configured to complete operation functions corresponding to any one of the first to third embodiments of the method of the present invention, and the model training module 803 is configured to complete operation functions corresponding to the fourth embodiment of the method of the present invention, which also achieves similar functional effects, and details are not repeated.
Fig. 10 is a schematic structural diagram of a text recognition apparatus according to a seventh embodiment of the present invention. As shown in fig. 10, the apparatus 100 includes: a processor 1001, a memory 1002, and computer programs stored on the memory 1002 and executable on the processor 1001.
When the processor 1001 runs the computer program, the text recognition method and/or the text recognition model training method provided by any one of the above method embodiments are implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes: ROM/RAM, magnetic disks, optical disks, etc., and the computer-readable storage medium stores a computer program that can be executed by a hardware device such as a terminal device, a computer, or a server to execute the text recognition method and/or the text recognition model training method.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (11)

1. A text recognition method, comprising:
extracting image features of the picture to be identified through a dense convolutional neural network;
adding two-dimensional position coding information to the image characteristics to generate image characteristics containing position information;
and decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain a recognition result.
2. The method according to claim 1, wherein the adding two-dimensional position code information to the image feature to generate the image feature containing the position information comprises:
generating a two-dimensional position code for each pixel in the image feature and generating a position code tensor for the image feature;
and adding the position coding tensor of the image features and the image features to obtain the image features containing the position information.
3. The method of claim 2, wherein generating a two-dimensional position code for each pixel in the image feature comprises:
determining position coding weights in the horizontal direction and the vertical direction according to the image features;
for any pixel in the image characteristics, respectively generating one-dimensional position codes of the pixel in the horizontal direction and the vertical direction;
and according to the position coding weights in the horizontal direction and the vertical direction, carrying out weighted summation on the one-dimensional position codes of the pixel in the horizontal direction and the vertical direction to obtain the two-dimensional position code of the pixel.
4. The method according to any one of claims 1 to 3, comprising at least one of said transformer decoding layers comprising a two-dimensional attention mechanism, each of said transformer decoding layers comprising: a multi-headed attention layer with a mask, a two-dimensional attention layer, and a feedforward neural network layer.
5. The method of claim 4, wherein the decoding the image feature including the position information by a transform decoding layer including a two-dimensional attention mechanism to obtain an identification result comprises:
processing the input character features through a multi-head attention layer with a mask to obtain first character features;
determining a two-dimensional attention vector through a two-dimensional attention layer according to the image feature containing the position information and the first character feature, and adding the two-dimensional attention vector to the first character feature to obtain a second character feature;
inputting the second character feature into the feed-forward neural network layer.
6. The method of claim 5, wherein determining a two-dimensional attention vector from the image feature containing location information and the first character feature by a two-dimensional attention layer comprises:
performing first volume processing on the image features containing the position information to obtain a first tensor of H × W × d, wherein H, W and d respectively represent the height, width and depth of the first tensor;
the first character features comprise at least one feature vector, a weight value of each feature vector of the first character features relative to the image features containing the position information is determined according to the first vector, and the weight value of each feature vector of the first character features relative to the image features containing the position information is the weight value of each pixel point of the image features containing the position information;
and weighting and summing the image features containing the position information according to the weight value of each pixel point to obtain the two-dimensional attention vector corresponding to each feature vector of the first character features.
7. The method according to claim 6, wherein determining a weight value of any one feature vector of the first character features with respect to the image features containing the position information according to the first vector comprises:
performing second convolution processing on the eigenvector to obtain a second tensor of 1 × 1 × d;
expanding the height and width of the second tensor to obtain a third tensor of H × W × d, wherein the height, the width and the depth of the third tensor are consistent with those of the first tensor;
adding the third tensor to the first tensor, and processing by adopting an activation function to obtain a fourth tensor of H × W × d;
performing a third convolution processing on the fourth tensor to obtain a fifth tensor of H × W × 1;
and performing two-dimensional softmax processing on the fifth tensor to obtain a weight value of the feature vector about the image feature containing the position information.
8. A method for training a text recognition model, wherein the text recognition model comprises: an encoding module and a decoding module; the encoding module is configured to: extracting image features of a picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features, and generating image features containing position information; the decoding module comprises a transformer decoding layer containing a two-dimensional attention mechanism, and the transformer decoding layer containing the two-dimensional attention mechanism is used for decoding the image features containing the position information to obtain an identification result;
the method comprises the following steps:
acquiring a training set for natural scene text recognition, wherein the training set at least comprises a plurality of pieces of bent text training data, and each piece of bent text training data comprises: the method comprises the steps of obtaining a sample picture containing a bent text and corresponding text marking information;
and training the text recognition model through the training set.
9. A text recognition system, comprising:
the coding module is used for extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features and generating image features containing position information;
and the decoding module is used for decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain an identification result.
10. A text recognition apparatus, comprising:
a processor, a memory, and a computer program stored on the memory and executable on the processor;
wherein the processor, when running the computer program, implements the method of any one of claims 1 to 8.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which can be executed to perform the method according to any one of claims 1 to 8.
CN202010270210.8A 2020-04-13 2020-04-13 Text recognition and model training method, system, equipment and readable storage medium Pending CN111507328A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010270210.8A CN111507328A (en) 2020-04-13 2020-04-13 Text recognition and model training method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010270210.8A CN111507328A (en) 2020-04-13 2020-04-13 Text recognition and model training method, system, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111507328A true CN111507328A (en) 2020-08-07

Family

ID=71875960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010270210.8A Pending CN111507328A (en) 2020-04-13 2020-04-13 Text recognition and model training method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111507328A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036292A (en) * 2020-08-27 2020-12-04 平安科技(深圳)有限公司 Character recognition method and device based on neural network and readable storage medium
CN112489740A (en) * 2020-12-17 2021-03-12 北京惠及智医科技有限公司 Medical record detection method, training method of related model, related equipment and device
CN112560652A (en) * 2020-12-09 2021-03-26 第四范式(北京)技术有限公司 Text recognition method and system and text recognition model training method and system
CN112686263A (en) * 2020-12-29 2021-04-20 科大讯飞股份有限公司 Character recognition method and device, electronic equipment and storage medium
CN112926684A (en) * 2021-03-29 2021-06-08 中国科学院合肥物质科学研究院 Character recognition method based on semi-supervised learning
CN113221879A (en) * 2021-04-30 2021-08-06 北京爱咔咔信息技术有限公司 Text recognition and model training method, device, equipment and storage medium
CN113255645A (en) * 2021-05-21 2021-08-13 北京有竹居网络技术有限公司 Method, device and equipment for decoding text line picture
CN113283427A (en) * 2021-07-20 2021-08-20 北京世纪好未来教育科技有限公司 Text recognition method, device, equipment and medium
CN113343903A (en) * 2021-06-28 2021-09-03 成都恒创新星科技有限公司 License plate recognition method and system in natural scene
CN113361522A (en) * 2021-06-23 2021-09-07 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment
CN113536785A (en) * 2021-06-15 2021-10-22 合肥讯飞数码科技有限公司 Text recommendation method, intelligent terminal and computer readable storage medium
CN113793403A (en) * 2021-08-19 2021-12-14 西南科技大学 Text image synthesis method for simulating drawing process
CN114254071A (en) * 2020-09-23 2022-03-29 Sap欧洲公司 Querying semantic data from unstructured documents
WO2022068426A1 (en) * 2020-09-30 2022-04-07 京东方科技集团股份有限公司 Text recognition method and text recognition system
CN114387431A (en) * 2022-01-12 2022-04-22 杭州电子科技大学 Multi-line character paper form OCR method based on semantic analysis
CN114445808A (en) * 2022-01-21 2022-05-06 上海易康源医疗健康科技有限公司 Swin transform-based handwritten character recognition method and system
CN114462580A (en) * 2022-02-10 2022-05-10 腾讯科技(深圳)有限公司 Training method of text recognition model, text recognition method, device and equipment
CN114693814A (en) * 2022-03-31 2022-07-01 北京字节跳动网络技术有限公司 Model decoding method, text recognition method, device, medium and equipment
CN114973224A (en) * 2022-04-12 2022-08-30 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and storage medium
CN116030471A (en) * 2022-12-29 2023-04-28 北京百度网讯科技有限公司 Text recognition method, training method, device and equipment for text recognition model
CN117710986A (en) * 2024-02-01 2024-03-15 长威信息科技发展股份有限公司 Method and system for identifying interactive enhanced image text based on mask

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180005082A1 (en) * 2016-04-11 2018-01-04 A2Ia S.A.S. Systems and methods for recognizing characters in digitized documents
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109726657A (en) * 2018-12-21 2019-05-07 万达信息股份有限公司 A kind of deep learning scene text recognition sequence method
CN109783827A (en) * 2019-01-31 2019-05-21 沈阳雅译网络技术有限公司 A kind of deep layer nerve machine translation method based on dynamic linear polymerization
CN109948604A (en) * 2019-02-01 2019-06-28 北京捷通华声科技股份有限公司 Recognition methods, device, electronic equipment and the storage medium of irregular alignment text
CN110378334A (en) * 2019-06-14 2019-10-25 华南理工大学 A kind of natural scene text recognition method based on two dimensional character attention mechanism
CN110598690A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 End-to-end optical character detection and identification method and system
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180005082A1 (en) * 2016-04-11 2018-01-04 A2Ia S.A.S. Systems and methods for recognizing characters in digitized documents
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109726657A (en) * 2018-12-21 2019-05-07 万达信息股份有限公司 A kind of deep learning scene text recognition sequence method
CN109783827A (en) * 2019-01-31 2019-05-21 沈阳雅译网络技术有限公司 A kind of deep layer nerve machine translation method based on dynamic linear polymerization
CN109948604A (en) * 2019-02-01 2019-06-28 北京捷通华声科技股份有限公司 Recognition methods, device, electronic equipment and the storage medium of irregular alignment text
CN110378334A (en) * 2019-06-14 2019-10-25 华南理工大学 A kind of natural scene text recognition method based on two dimensional character attention mechanism
CN110598690A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 End-to-end optical character detection and identification method and system
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HUI LI等: "Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition", 《COMPUTER VISION AND PATTERN RECOGNITION》 *
JUNYEOP LEE等: "On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention", 《COMPUTER VISION AND PATTERN RECOGNITION》 *
徐清泉: "基于注意力机制的中文识别算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
杨志成: "基于2D注意力机制的不规则场景文本识别算法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021147569A1 (en) * 2020-08-27 2021-07-29 平安科技(深圳)有限公司 Neural network-based character recognition method and apparatus, and readable storage medium
CN112036292A (en) * 2020-08-27 2020-12-04 平安科技(深圳)有限公司 Character recognition method and device based on neural network and readable storage medium
CN112036292B (en) * 2020-08-27 2024-06-04 平安科技(深圳)有限公司 Word recognition method and device based on neural network and readable storage medium
CN114254071A (en) * 2020-09-23 2022-03-29 Sap欧洲公司 Querying semantic data from unstructured documents
WO2022068426A1 (en) * 2020-09-30 2022-04-07 京东方科技集团股份有限公司 Text recognition method and text recognition system
CN112560652A (en) * 2020-12-09 2021-03-26 第四范式(北京)技术有限公司 Text recognition method and system and text recognition model training method and system
CN112560652B (en) * 2020-12-09 2024-03-05 第四范式(北京)技术有限公司 Text recognition method and system and text recognition model training method and system
CN112489740A (en) * 2020-12-17 2021-03-12 北京惠及智医科技有限公司 Medical record detection method, training method of related model, related equipment and device
CN112686263B (en) * 2020-12-29 2024-04-16 科大讯飞股份有限公司 Character recognition method, character recognition device, electronic equipment and storage medium
CN112686263A (en) * 2020-12-29 2021-04-20 科大讯飞股份有限公司 Character recognition method and device, electronic equipment and storage medium
CN112926684B (en) * 2021-03-29 2022-11-29 中国科学院合肥物质科学研究院 Character recognition method based on semi-supervised learning
CN112926684A (en) * 2021-03-29 2021-06-08 中国科学院合肥物质科学研究院 Character recognition method based on semi-supervised learning
CN113221879A (en) * 2021-04-30 2021-08-06 北京爱咔咔信息技术有限公司 Text recognition and model training method, device, equipment and storage medium
CN113255645A (en) * 2021-05-21 2021-08-13 北京有竹居网络技术有限公司 Method, device and equipment for decoding text line picture
CN113255645B (en) * 2021-05-21 2024-04-23 北京有竹居网络技术有限公司 Text line picture decoding method, device and equipment
CN113536785A (en) * 2021-06-15 2021-10-22 合肥讯飞数码科技有限公司 Text recommendation method, intelligent terminal and computer readable storage medium
CN113361522B (en) * 2021-06-23 2022-05-17 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment
CN113361522A (en) * 2021-06-23 2021-09-07 北京百度网讯科技有限公司 Method and device for determining character sequence and electronic equipment
CN113343903A (en) * 2021-06-28 2021-09-03 成都恒创新星科技有限公司 License plate recognition method and system in natural scene
CN113343903B (en) * 2021-06-28 2024-03-26 成都恒创新星科技有限公司 License plate recognition method and system in natural scene
CN113283427A (en) * 2021-07-20 2021-08-20 北京世纪好未来教育科技有限公司 Text recognition method, device, equipment and medium
CN113283427B (en) * 2021-07-20 2021-10-01 北京世纪好未来教育科技有限公司 Text recognition method, device, equipment and medium
CN113793403B (en) * 2021-08-19 2023-09-22 西南科技大学 Text image synthesizing method for simulating painting process
CN113793403A (en) * 2021-08-19 2021-12-14 西南科技大学 Text image synthesis method for simulating drawing process
CN114387431A (en) * 2022-01-12 2022-04-22 杭州电子科技大学 Multi-line character paper form OCR method based on semantic analysis
CN114445808A (en) * 2022-01-21 2022-05-06 上海易康源医疗健康科技有限公司 Swin transform-based handwritten character recognition method and system
CN114462580A (en) * 2022-02-10 2022-05-10 腾讯科技(深圳)有限公司 Training method of text recognition model, text recognition method, device and equipment
CN114693814A (en) * 2022-03-31 2022-07-01 北京字节跳动网络技术有限公司 Model decoding method, text recognition method, device, medium and equipment
CN114693814B (en) * 2022-03-31 2024-04-30 北京字节跳动网络技术有限公司 Decoding method, text recognition method, device, medium and equipment for model
CN114973224A (en) * 2022-04-12 2022-08-30 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and storage medium
CN116030471A (en) * 2022-12-29 2023-04-28 北京百度网讯科技有限公司 Text recognition method, training method, device and equipment for text recognition model
CN117710986A (en) * 2024-02-01 2024-03-15 长威信息科技发展股份有限公司 Method and system for identifying interactive enhanced image text based on mask
CN117710986B (en) * 2024-02-01 2024-04-30 长威信息科技发展股份有限公司 Method and system for identifying interactive enhanced image text based on mask

Similar Documents

Publication Publication Date Title
CN111507328A (en) Text recognition and model training method, system, equipment and readable storage medium
US11881038B2 (en) Multi-directional scene text recognition method and system based on multi-element attention mechanism
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN112766158B (en) Multi-task cascading type face shielding expression recognition method
CN113343707B (en) Scene text recognition method based on robustness characterization learning
CN111160343B (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN113591546B (en) Semantic enhancement type scene text recognition method and device
CN111783705B (en) Character recognition method and system based on attention mechanism
US20190180154A1 (en) Text recognition using artificial intelligence
CN113158862B (en) Multitasking-based lightweight real-time face detection method
He et al. Visual semantics allow for textual reasoning better in scene text recognition
CN114973222B (en) Scene text recognition method based on explicit supervision attention mechanism
CN113065550B (en) Text recognition method based on self-attention mechanism
CN112818850B (en) Cross-posture face recognition method and system based on progressive neural network and attention mechanism
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN113221879A (en) Text recognition and model training method, device, equipment and storage medium
CN110969089A (en) Lightweight face recognition system and recognition method under noise environment
CN115761757A (en) Multi-mode text page classification method based on decoupling feature guidance
CN114387641A (en) False video detection method and system based on multi-scale convolutional network and ViT
CN113935899B (en) Ship board image super-resolution method based on semantic information and gradient supervision
CN113313127A (en) Text image recognition method and device, computer equipment and storage medium
CN115527064A (en) Toxic mushroom fine-grained image classification method based on multi-stage ViT and contrast learning
Viswanathan et al. Text to image translation using generative adversarial networks
CN115797949A (en) Optimized scene text recognition system and method
CN116702876B (en) Image countermeasure defense method based on preprocessing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination