CN112633422B - Training method of text recognition model, text recognition method, device and equipment - Google Patents

Training method of text recognition model, text recognition method, device and equipment Download PDF

Info

Publication number
CN112633422B
CN112633422B CN202110258666.7A CN202110258666A CN112633422B CN 112633422 B CN112633422 B CN 112633422B CN 202110258666 A CN202110258666 A CN 202110258666A CN 112633422 B CN112633422 B CN 112633422B
Authority
CN
China
Prior art keywords
neural network
text
model
convolutional neural
image data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110258666.7A
Other languages
Chinese (zh)
Other versions
CN112633422A (en
Inventor
李自荐
秦勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yizhen Xuesi Education Technology Co Ltd
Original Assignee
Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yizhen Xuesi Education Technology Co Ltd filed Critical Beijing Yizhen Xuesi Education Technology Co Ltd
Priority to CN202110258666.7A priority Critical patent/CN112633422B/en
Publication of CN112633422A publication Critical patent/CN112633422A/en
Application granted granted Critical
Publication of CN112633422B publication Critical patent/CN112633422B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a training method of a text recognition model, a text recognition method, a text recognition device and text recognition equipment. The training method comprises the following steps: constructing an initial model; the method comprises the steps that the output of first text image data through a recurrent neural network is used as character string input of a word embedding module, a first feature map obtained after the first text image data passes through a second convolutional neural network of a second part is used as the other input of the word embedding module, and an initial model is trained to obtain a converged initial model; obtaining a text recognition model based on the converged initial model; wherein the initial model includes a first portion for identifying textual content of the image, the first portion having a first convolutional neural network and a recurrent neural network; a second part for determining whether the given text is in the given image, the second part having a second convolutional neural network and a word embedding module; the device is used for executing the method. The training method can obtain the text recognition model which has high recognition speed and can give consideration to higher recognition precision.

Description

Training method of text recognition model, text recognition method, device and equipment
Technical Field
The invention relates to a text recognition technology, in particular to a training method of a text recognition model, a text recognition method, a text recognition device and equipment.
Background
Text detection and recognition have a wide application range, and are pre-steps of many computer vision tasks, such as image search, identity authentication, visual navigation, and the like, the main purpose of text detection is to locate a text line or a character in an image, while text recognition is to transcribe an image with a text line into a character string (to recognize the content of the character string), and accurate location and accurate recognition of a text are very important and challenging, because characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, fonts, colors, backgrounds, and the like compared with general target detection and recognition, algorithms which are often successful in general target detection and recognition cannot be directly transferred to character detection.
The recognition effect of the existing text recognition model and method is influenced by a plurality of factors, the recognition speed and the recognition precision are difficult to combine, and the requirement for rapid development of computer vision tasks cannot be met.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present invention provides a training method, a text recognition method, a device and an apparatus for a text recognition model.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for training a text recognition model, including:
constructing an initial model, the initial model comprising:
a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second section for determining whether a given text is in a given image, the second section having a second convolutional neural network and a word embedding model;
taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;
based on the converged initial model, a text recognition model is obtained.
In one embodiment, the training of the initial model to obtain a converged initial model with the output of the first text image data via the recurrent neural network as the character string input of the word embedding module and based on the first feature map obtained after the first text image data is passed through the second convolutional neural network of the second part as the other input of the word embedding module includes:
inputting the first text image data into a first convolution neural network and the recurrent neural network of a first part to obtain a character coding matrix;
inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training an initial model according to a first loss function to obtain a converged initial model.
In one embodiment, the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.
In one embodiment, the first part is a model obtained by training as follows:
training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text content in an image as a first part;
the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.
In one embodiment, the second part is a model obtained by training as follows:
training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;
the pre-constructed second model is a model constructed based on a second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.
In one embodiment, training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data includes:
inputting the third text image data into the second convolutional neural network to obtain a second feature map;
inputting the word vector as a character string of the word embedding module, and inputting the second feature map as another input of the word embedding module; training the pre-constructed second model based on the first loss function; wherein the first loss function is a two-class cross entropy loss function.
In one embodiment, the word embedding module includes an attention-based codec.
In one embodiment, the first convolutional neural network comprises four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;
and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
In one embodiment, the second convolutional neural network comprises four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;
the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
In one embodiment, the obtaining a text recognition model based on the converged initial model comprises:
building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,
and taking the converged initial model as the text recognition model.
In a second aspect, an embodiment of the present invention provides a text recognition method, including:
inputting an image to be recognized into a pre-obtained text recognition model for text recognition, and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the embodiments of the first aspect.
In a third aspect, an embodiment of the present invention provides a training apparatus for a text recognition model, including:
the initial model building module is used for building an initial model;
the initial model training module is used for taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, taking a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, and training the initial model to obtain a converged initial model;
a text recognition model obtaining module, configured to obtain a text recognition model based on the converged initial model;
wherein the initial model comprises:
a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second portion for determining whether a given text is in a given image, the second portion having a second convolutional neural network and a word embedding module.
In one embodiment, the initial model training module is specifically configured to:
inputting the first text image data into a first convolution neural network and a first circulation neural network of a first part to obtain a character coding matrix;
inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training the initial model according to a first loss function to obtain a converged initial model.
In one embodiment, the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.
In one embodiment, the initial model building module comprises: the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;
the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.
In one embodiment, the initial model building module comprises:
the second part training module is used for training a pre-constructed second model according to a first loss function based on third text image data and word vectors corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;
the pre-constructed second model is a model constructed based on a second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.
In one embodiment, training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data includes:
inputting the third text image data into the second convolutional neural network to obtain a second feature map;
the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a two-class cross entropy loss function.
In one embodiment, the word embedding module includes an attention-based codec.
In one embodiment, the first convolutional neural network comprises four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;
and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
In one embodiment, the second convolutional neural network comprises four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;
the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
In an embodiment, the text recognition model obtaining module is specifically configured to:
building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,
and taking the converged initial model as the text recognition model.
In a fourth aspect, an embodiment of the present invention provides a text recognition apparatus, including:
the text recognition module is used for inputting the image to be recognized into a text recognition model obtained in advance for text recognition and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the embodiments of the above aspects.
In a fifth aspect, an embodiment of the present invention provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method in any one of the above-mentioned aspects.
In a sixth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of the above aspects.
The advantages or beneficial effects in the above technical solution at least include:
the technical scheme of the invention constructs an initial model by a first part with the capacity of identifying the text content of the image and a second part capable of judging whether the given text is in the given image, takes the output of the recurrent neural network of the first part as the input of a word embedding module of the second part, thereby realizing the combination of the first part and the second part, and trains the initial model to achieve the purpose of combined training. After training is finished, a text recognition model is obtained based on the converged initial model, so that the text recognition model indirectly has the capability of modeling the relationship between characters, the recognition speed is improved, and meanwhile, higher recognition accuracy is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a flow chart of a method of training a text recognition model according to an embodiment of the present invention;
FIG. 2 is a flow chart of a training method of a first part of an embodiment of the present invention;
FIG. 3 is a flow chart of a training method of a second part of the embodiment of the present invention;
FIG. 4 is a flow chart of a method of training an initial model in an embodiment of the invention;
FIG. 5 is a logic diagram of an apparatus for training a text recognition model according to an embodiment of the present invention;
FIG. 6 is a flow chart of a text recognition method of an embodiment of the present invention;
FIG. 7 is a logic diagram of a text recognition apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by a related server, and the following description takes an electronic device such as a server or a computer as an example of an execution subject.
Example one
Referring to fig. 1, an embodiment of the present invention provides a method for training a text recognition model, including:
step S1: constructing an initial model, wherein the initial model is composed of basic parts such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a word embedding module, the Convolutional Neural Network comprises a first Convolutional Neural Network and a second Convolutional Neural Network, the initial model comprises a first part and a second part, and the first part and the second part are respectively composed of a plurality of basic parts, wherein:
a first portion for identifying text content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second part for determining whether the given text is in the given image, the second part having a word embedding module and a second convolutional neural network; the second convolutional neural network is obtained based on parameters of the first convolutional neural network;
step S2: the method comprises the steps that the output of first text image data through a recurrent neural network is used as character string input of a word embedding module, and an initial model is trained to obtain a converged initial model based on a feature map obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;
step S3: based on the converged initial model, a text recognition model is obtained.
According to the embodiment of the invention, an initial model is constructed by a first part with the capacity of recognizing the text content of the image and a second part capable of judging whether the given text is in the given image, the output of a recurrent neural network of the first part is used as the input of a word embedding module of the second part, so that the first part and the second part are combined, the initial model has the capacity of judging whether the given text is in the given image through the second part, and the first part also indirectly has the judgment capacity through the combined training, so that the capacity of modeling the relation between characters can be indirectly possessed, the recognition speed is improved, and meanwhile, the higher recognition accuracy is realized.
Fig. 2 illustrates a flowchart of a method for implementing the first part in the embodiment of the present invention, where the first part can be obtained as follows:
step S11: constructing a first model in advance based on a first convolution neural network and a recurrent neural network, wherein the output of the first convolution neural network is the input of the recurrent neural network;
step S12: and training a pre-constructed first model according to a second loss function based on the second text image data and the text label information of the second text image data to obtain an identification model capable of identifying text content in the image.
In this embodiment, the second loss function is a CTC loss function (CTC), enabling supervised training of the first portion based on the CTC loss function; it can be seen that the structure of the first part of this embodiment is similar to the crnn (volumetric recovery Neural network) structure when having a CTC loss function.
The CRNN is composed of a convolutional neural network, a cyclic neural network and a translation layer from bottom to top, wherein the convolutional neural network is responsible for extracting features from a picture with characters, the cyclic neural network is responsible for carrying out sequence prediction by using the features extracted by the convolutional neural network, the translation layer translates a sequence obtained by the cyclic neural network into a letter sequence, and a target function selects a CTC loss function; one advantage of CRNN is that it can be trained end-to-end despite incorporating different types of network architectures, and thus, the first part of this embodiment can also be trained end-to-end.
As an embodiment, a double-layer bidirectional LSTM network (LSTM) is adopted to construct a recurrent neural network, and the input of the recurrent neural network is a matrix of L × N; unlike the conventional model, in which the output of the convolutional neural network is a matrix of L × N, where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary, the first convolutional neural network in this embodiment includes four Block blocks arranged in sequence.
In this embodiment, a first text image is input into a first convolutional neural network, feature extraction is performed on each Block in the first convolutional neural network, after a corresponding feature map is obtained, the feature maps of four Block blocks are respectively subjected to down-sampling or up-sampling processing to obtain processed feature maps, and the processed feature maps of the four Block blocks have the same first size, so that the output of the first convolutional neural network meets the same length of the input of the cyclic neural network, in this embodiment, the feature maps of the first three Block blocks in the four Block blocks are respectively subjected to down-sampling processing to obtain processed feature maps, and the first size is an original input image 1/32; and then, the outputs of the four Block blocks are added point by point of corresponding position elements, so that the output of the first convolution neural network forms an L-N matrix which is used as the input of the recurrent neural network. Where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary.
FIG. 3 is a flow chart illustrating a method for implementing the second part of the embodiment of the present invention; the second part may be obtained according to the following way:
s13: pre-constructing a second model based on a second convolutional neural network and a word embedding module; wherein the output of the second convolutional neural network is the other input of the word embedding module;
s14: and training a pre-constructed second model according to the first loss function based on the third text image data and the word vectors corresponding to the text labeling information of the third text image data, so as to obtain a model capable of judging whether the given text is in the given image.
In this embodiment, the word embedding module of the second part includes a codec based on an Attention (Attention) mechanism, the codec may be regarded as a layer of simple RNN, the Attention part may be regarded as sandwiched between two RNNs, the RNN before the Attention part is a cyclic neural network of the first part, the RNN after the Attention part is a decoder, and the Attention part includes q, x, v, where v is a set of x and q is a query vector.
In this embodiment, when the second part is trained separately, a word vector obtained by encoding text information of a text image is used as a query vector q, and a matrix output by the second convolutional neural network is used as v, that is, another input of the word embedding module; when the first part and the second part are subjected to fusion training, namely an initial model is trained, the output of the cyclic neural network of the first part is used as a character string input of a word embedding module, namely a query vector q; and calculating the similarity between q and v to obtain an input weight value of each x in v, using the input weight value as the input of a decoder, decoding, and obtaining a loss value according to a first loss function so as to train the initial model.
The text information is coded by selecting a smooth One-Hot coding mode or a Word2vec coding mode, so that each character in the text information is coded into a vector with a specified length, One character string is a matrix with a fixed size, and compared with the common One-Hot coding mode, the positions of corresponding elements of the smooth One-Hot coding mode are 0.9, and the positions of other elements are 0.1.
In the embodiment of the present invention, before the initial model is trained, the second part is trained in advance, and when the second part is trained separately, parameters (including parameters of the second convolutional neural network and the word embedding module) of the second part are adjusted according to the text image data and the word vector corresponding to the label information of the text image data, so as to obtain the trained second part. That is, in step S14, training the second model that is constructed in advance according to the first loss function based on the third text image data and the word vector corresponding to the text label information of the third text image data includes:
inputting the third text image data into a second convolution neural network to obtain a second characteristic diagram;
the word vector is used as a character string of the word embedding module for input, and the second characteristic diagram is used as the other input of the word embedding module; training a pre-constructed second model based on the first loss function; the output of the word embedding module is connected with a Softmax function module; wherein the first loss function is a two-class cross entropy loss function.
In an ideal state, the predicted value and the true value of the word embedding module are consistent. The first loss function is a two-class cross-entropy loss function. Embodiments of the invention provide supervised training of the second part by training the second part to approach the ideal state as much as possible, followed by a Softmax function for obtaining the loss value.
As an embodiment, the second convolutional neural network comprises four Block blocks arranged in sequence;
in this embodiment, after a third text image is input into the second convolutional neural network and feature extraction is performed on each Block in the second convolutional neural network, a feature map is obtained, the feature maps of the four Block blocks are respectively subjected to down-sampling or up-sampling processing to respectively obtain processed feature maps, and the processed feature maps of the four Block blocks have the same second size, so that the output of the second convolutional neural network meets the same length of the input of the word embedding module, in this embodiment, the feature maps of the first two Block blocks in the four Block blocks are respectively subjected to down-sampling processing, and the feature map of the last Block is subjected to up-sampling processing to obtain the processed feature maps, so that the second size is the original input image 1/16; and then, the outputs of the four Block blocks are added point by point of corresponding position elements, so that the output of the second convolutional neural network forms an L-N matrix, and a feature map of the output of the second convolutional neural network is obtained and used as the input of the word embedding module.
Fig. 4 is a flowchart illustrating a training method of the initial model in the embodiment of the present invention, and in one embodiment, step S2: the method comprises the following steps of taking the output of first text image data through a recurrent neural network as character string input of a word embedding module, taking a first feature map obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, training an initial model to obtain a converged initial model, and obtaining the converged initial model, wherein the method comprises the following steps:
s21: inputting first text image data into a first convolution neural network and a circulation neural network of a first part to obtain a character coding matrix;
s22: inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
s23: and inputting the character coding matrix and the first characteristic diagram into a word embedding module of the second part, obtaining a loss value according to the first loss function, and adjusting parameters of the first part and the second part according to the loss value to obtain a converged initial model.
In one embodiment, after the training of the second part is completed, before the step S2 is performed, the parameters of the recurrent neural network are fixed, so that when the initial model is trained, the parameters of the first convolutional neural network and the second convolutional neural network are fine-tuned, and the parameters of the recurrent neural network are not changed. In addition, when the initial model is trained, the CTC loss function of the first part can be masked or removed, and the loss value is obtained by the first loss function of the second part for supervised training, so that the first part and the second part achieve the purpose of fusion training.
In this embodiment, the character coding matrix obtained based on the first part and the first feature map obtained based on the second part are respectively used as two inputs of the word embedding module of the second part, and the whole initial model is supervised and trained according to the first loss function, so as to achieve the purpose of fusion training, and finally obtain the converged initial model, so that the initial model has a faster recognition speed and a higher recognition accuracy.
As an optional implementation manner of this embodiment, obtaining the text recognition model based on the converged initial model includes: constructing a text recognition model based on the converged first portion of the initial model; after fusion training, the first part has the capability of indirectly modeling the relationship between the characters, so that the recognition result with higher recognition accuracy can be output, and the recognition speed is improved without the need of distinguishing the second part.
As can be seen from the above, according to the training method of the embodiment, an initial model is formed by a first portion having the capability of recognizing the text content of an image and a second portion capable of determining whether a given text is in the given image, and sharing of two convolutional neural network parameters is implemented by using a first convolutional neural network of the first portion as an initial parameter of a second convolutional neural network of the second portion, and an output of a cyclic neural network of the first portion is used as an input of a word embedding module of the second portion, so as to implement fusion of the first portion and the second portion, and this initial model is trained, and the whole initial model (including the first portion and the second portion) is supervised and trained by using the same loss function, and finally the purpose of fusion training is achieved, wherein the second portion can be used for determining whether the given text is in the given by the first portion, so as to determine whether the text recognition result given by the first portion is correct, the function of discrimination is played. After training is finished, a text recognition model is constructed based on the first part, so that the text recognition model indirectly has the capability of modeling the relationship between characters, the recognition speed is improved, and meanwhile, the realization of higher recognition accuracy is realized.
As an optional implementation manner of this embodiment, obtaining the text recognition model based on the converged initial model includes: when the converged initial model is used as a text recognition model, the whole initial model is used as a text recognition model, when text recognition is carried out, the output (L matrix of N) of the first part of the recurrent neural network is used as the input of a word embedding module, a feature map obtained by combining the second part of the second recurrent neural network is used as the other input of the word embedding module, the word embedding module outputs the other L matrix of N, compared with the L matrix of N output by the recurrent neural network, the L matrix of N output by the word embedding module has more probability values of texts belonging to images, finally, the output of the word embedding module is decoded, and the maximum probability value is selected as the final text content, so that the recognition accuracy is further improved.
Example two
Referring to fig. 5, an embodiment of the present invention provides a training apparatus for a text recognition model, including:
the initial model building module is used for building an initial model;
the initial model training module is used for taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;
a text recognition model obtaining module, configured to obtain a text recognition model based on the converged initial model;
wherein the initial model comprises:
a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second portion for determining whether a given text is in a given image, the second portion having a second convolutional neural network and a word embedding module.
In one embodiment, the initial model training module is specifically configured to:
inputting the first text image data into a first convolution neural network and a first circulation neural network of a first part to obtain a character coding matrix;
inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training the initial model according to a first loss function to obtain a converged initial model.
As an optional implementation manner of this embodiment, the parameter of the second convolutional neural network is a parameter obtained based on the parameter of the first convolutional neural network.
As an optional implementation manner of this embodiment, the initial model building module includes: the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;
the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.
As an optional implementation manner of this embodiment, the initial model building module includes:
the second part training module is used for training a pre-constructed second model according to a first loss function based on third text image data and word vectors corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;
the pre-constructed second model is a model constructed based on a second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.
As an optional implementation manner of this embodiment, training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data includes:
inputting the third text image data into the second convolutional neural network to obtain a second feature map;
the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a two-class cross entropy loss function.
As an optional implementation manner of this embodiment, the word embedding module includes a codec based on an attention mechanism.
As an optional implementation manner of this embodiment, the first convolutional neural network includes four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;
and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
As an optional implementation manner of this embodiment, the second convolutional neural network includes four Block blocks arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;
the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
As an optional implementation manner of this embodiment, the text recognition model obtaining module is specifically configured to:
building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,
and taking the converged initial model as the text recognition model.
The principle and function of each module in the device of the present embodiment are the same as those in the first embodiment, and the description of the present embodiment is not repeated.
EXAMPLE III
Referring to fig. 6, an embodiment of the present invention provides a text recognition method, including:
inputting the image to be recognized into a pre-obtained text recognition model for text recognition, and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the preceding embodiments.
The embodiment is based on the text recognition model obtained by the method of any one of the foregoing embodiments, and the text recognition model can accurately recognize text information in an image, and a text image to be recognized is used as an input.
Example four
Referring to fig. 7, an embodiment of the present invention provides a text recognition apparatus, including:
the text recognition module is used for inputting the image to be recognized into a text recognition model obtained in advance for text recognition and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the preceding embodiments.
The principle and function of each module in the apparatus of the present embodiment are the same as those in the foregoing embodiments, and the description of the present embodiment is not repeated.
EXAMPLE five
The present embodiment provides a readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method in any one of the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
EXAMPLE six
Referring to fig. 8, the present embodiment provides an electronic apparatus including: a processor and a memory, the memory storing instructions therein, the instructions being loaded and executed by the processor to implement the method of any of the above embodiments.
It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.
The memory may include read only memory and random access memory, and may also include non-volatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data rate Synchronous Dynamic Random Access Memory (ddr SDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (synclink DRAM), and DRRAM (Direct RAM).
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful for implementing logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of description and are not intended to limit the scope of the invention. Other variations or modifications will occur to those skilled in the art based on the foregoing disclosure and are within the scope of the invention.

Claims (24)

1. A training method of a text recognition model is characterized by comprising the following steps:
constructing an initial model, the initial model comprising:
a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second part for determining whether a given text is in a given image, the second part having a second convolutional neural network and a word embedding module;
taking the output of the first text image data through the first convolutional neural network and the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;
based on the converged initial model, a text recognition model is obtained.
2. The method of claim 1,
the method for acquiring the character string of the word embedding module by using the output of the first text image data through the first convolutional neural network and the recurrent neural network as the character string input of the word embedding module, and training the initial model to acquire a converged initial model based on a first feature map acquired by the first text image data through a second convolutional neural network of a second part as the other input of the word embedding module, includes:
inputting the first text image data into a first convolution neural network and a first circulation neural network of a first part to obtain a character coding matrix;
inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training the initial model according to a first loss function to obtain a converged initial model.
3. The method of claim 1,
the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.
4. The method of claim 1,
the first part is a model obtained by training as follows:
training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text content in an image as a first part;
the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.
5. The method of claim 1,
the second part is a model obtained by training in the following way:
training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;
the pre-constructed second model is a model constructed based on a second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.
6. The method of claim 5,
training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data, including:
inputting the third text image data into the second convolutional neural network to obtain a second feature map;
the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a two-class cross entropy loss function.
7. The method of claim 1, wherein the word embedding module comprises an attention-based codec.
8. The method of claim 1,
the first convolutional neural network comprises four Block blocks which are arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;
and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
9. The method of claim 1,
the second convolutional neural network comprises four Block blocks which are arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;
the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
10. The method of claim 1,
the obtaining a text recognition model based on the converged initial model comprises:
building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,
and taking the converged initial model as the text recognition model.
11. A text recognition method, comprising:
inputting an image to be recognized into a pre-obtained text recognition model for text recognition, and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of claims 1 to 10.
12. An apparatus for training a text recognition model, comprising:
the initial model building module is used for building an initial model;
the initial model training module is used for taking the output of first text image data through a first convolutional neural network and a cyclic neural network as the character string input of the word embedding module, taking a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, and training the initial model to obtain a converged initial model;
the text recognition model acquisition module is used for acquiring a text recognition model based on the converged initial model;
wherein the initial model comprises:
a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;
a second portion for determining whether a given text is in a given image, the second portion having a second convolutional neural network and a word embedding module.
13. The apparatus of claim 12, wherein the initial model training module is specifically configured to:
inputting the first text image data into a first convolution neural network and a first circulation neural network of a first part to obtain a character coding matrix;
inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;
and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training the initial model according to a first loss function to obtain a converged initial model.
14. The apparatus of claim 12,
the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.
15. The apparatus of claim 12, wherein the initial model building module comprises:
the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;
the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.
16. The apparatus of claim 12, wherein the initial model building module comprises:
the second part training module is used for training a pre-constructed second model according to a first loss function based on third text image data and word vectors corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;
the pre-constructed second model is a model constructed based on the second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.
17. The apparatus of claim 16,
training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data, including:
inputting the third text image data into the second convolutional neural network to obtain a second feature map;
the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a first loss function of a binary cross-entropy loss function.
18. The apparatus of claim 12, wherein the word embedding module comprises an attention-based codec.
19. The apparatus of claim 12,
the first convolutional neural network comprises four Block blocks which are arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;
and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
20. The apparatus of claim 12,
the second convolutional neural network comprises four Block blocks which are arranged in sequence;
the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;
the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.
21. The apparatus of claim 12, wherein the text recognition model obtaining module is specifically configured to:
building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,
and taking the converged initial model as the text recognition model.
22. A text recognition apparatus, comprising:
the text recognition module is used for inputting the image to be recognized into a text recognition model obtained in advance for text recognition and outputting a text recognition result of the image to be recognized;
wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of claims 1 to 10.
23. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.
24. An electronic device, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 11.
CN202110258666.7A 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment Active CN112633422B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110258666.7A CN112633422B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110258666.7A CN112633422B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Publications (2)

Publication Number Publication Date
CN112633422A CN112633422A (en) 2021-04-09
CN112633422B true CN112633422B (en) 2021-06-22

Family

ID=75297825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110258666.7A Active CN112633422B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Country Status (1)

Country Link
CN (1) CN112633422B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269189B (en) * 2021-07-20 2021-10-08 北京世纪好未来教育科技有限公司 Construction method of text recognition model, text recognition method, device and equipment
CN113963358B (en) * 2021-12-20 2022-03-04 北京易真学思教育科技有限公司 Text recognition model training method, text recognition device and electronic equipment
CN115050014A (en) * 2022-06-15 2022-09-13 河北农业大学 Small sample tomato disease identification system and method based on image text learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948604A (en) * 2019-02-01 2019-06-28 北京捷通华声科技股份有限公司 Recognition methods, device, electronic equipment and the storage medium of irregular alignment text
CN111405360A (en) * 2020-03-25 2020-07-10 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
WO2020146119A1 (en) * 2019-01-11 2020-07-16 Microsoft Technology Licensing, Llc Compositional model for text recognition
CN111461105A (en) * 2019-01-18 2020-07-28 顺丰科技有限公司 Text recognition method and device
CN111723789A (en) * 2020-02-19 2020-09-29 王春宝 Image text coordinate positioning method based on deep learning
CN111860389A (en) * 2020-07-27 2020-10-30 北京易真学思教育科技有限公司 Data processing method, electronic device and computer readable medium
CN112016315A (en) * 2020-10-19 2020-12-01 北京易真学思教育科技有限公司 Model training method, text recognition method, model training device, text recognition device, electronic equipment and storage medium
CN112381079A (en) * 2019-07-29 2021-02-19 富士通株式会社 Image processing method and information processing apparatus

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020146119A1 (en) * 2019-01-11 2020-07-16 Microsoft Technology Licensing, Llc Compositional model for text recognition
CN111461105A (en) * 2019-01-18 2020-07-28 顺丰科技有限公司 Text recognition method and device
CN109948604A (en) * 2019-02-01 2019-06-28 北京捷通华声科技股份有限公司 Recognition methods, device, electronic equipment and the storage medium of irregular alignment text
CN112381079A (en) * 2019-07-29 2021-02-19 富士通株式会社 Image processing method and information processing apparatus
CN111723789A (en) * 2020-02-19 2020-09-29 王春宝 Image text coordinate positioning method based on deep learning
CN111405360A (en) * 2020-03-25 2020-07-10 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
CN111860389A (en) * 2020-07-27 2020-10-30 北京易真学思教育科技有限公司 Data processing method, electronic device and computer readable medium
CN112016315A (en) * 2020-10-19 2020-12-01 北京易真学思教育科技有限公司 Model training method, text recognition method, model training device, text recognition device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112633422A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112633422B (en) Training method of text recognition model, text recognition method, device and equipment
Kang et al. Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition
RU2691214C1 (en) Text recognition using artificial intelligence
CN109783655B (en) Cross-modal retrieval method and device, computer equipment and storage medium
US10354168B2 (en) Systems and methods for recognizing characters in digitized documents
CN114387430B (en) Image description generation method, device, equipment and medium based on artificial intelligence
CN110795543A (en) Unstructured data extraction method and device based on deep learning and storage medium
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
Jain et al. Unconstrained scene text and video text recognition for arabic script
CN112633423B (en) Training method of text recognition model, text recognition method, device and equipment
CN110321913B (en) Text recognition method and device
CN111160348A (en) Text recognition method for natural scene, storage device and computer equipment
Lei et al. Scene text recognition using residual convolutional recurrent neural network
CN112395412B (en) Text classification method, apparatus and computer readable medium
CN113887229A (en) Address information identification method and device, computer equipment and storage medium
CN110942057A (en) Container number identification method and device and computer equipment
CN110866402A (en) Named entity identification method and device, storage medium and electronic equipment
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
CN111563380A (en) Named entity identification method and device
CN114036950A (en) Medical text named entity recognition method and system
CN115587583A (en) Noise detection method and device and electronic equipment
CN116883737A (en) Classification method, computer device, and storage medium
Vankadaru et al. Text Identification from Handwritten Data using Bi-LSTM and CNN with FastAI
CN111738248B (en) Character recognition method, training method of character decoding model and electronic equipment
CN117893859A (en) Multi-mode text image classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant