CN112861739A

CN112861739A - End-to-end text recognition method, model training method and device

Info

Publication number: CN112861739A
Application number: CN202110186700.4A
Authority: CN
Inventors: 张勇东; 周宇; 谢洪涛
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2021-05-28
Anticipated expiration: 2041-02-10
Also published as: CN112861739B

Abstract

An end-to-end text recognition method, a model training method and a device are provided, wherein the model training method comprises the following steps: constructing an initial end-to-end text recognition model, wherein the initial end-to-end text recognition model comprises an initial text detection module and an initial text recognition module; acquiring a training sample data set; processing training samples in the training sample data set by using a sample generation algorithm, and generating an amplified training sample data set so as to increase the number of training samples for training the initial text recognition module; and training the initial end-to-end text recognition model by utilizing the training sample data set and the augmented training sample data set to obtain an end-to-end text recognition model. According to the technical scheme, a large number of training samples for training the text recognition module are generated by using the sample generation algorithm, so that the problems of over-fitting and under-fitting of the text detection module and the text recognition module are effectively solved, and the recognition accuracy of the end-to-end text recognition model is improved.

Description

End-to-end text recognition method, model training method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an end-to-end text recognition method, and an end-to-end text recognition model training method and device.

Background

An end-to-end text recognition method generally integrates a text detection module for detecting the location of a text and a text recognition module for recognizing the content of the text into one network model. End-to-end text recognition has wide application in the fields of automatic driving, machine translation, commodity retrieval and the like. But aiming at different application fields, a text recognition module and a text detection module in the model need to be trained so as to achieve better recognition accuracy. The ratio of the number of samples required by the training text recognition module to the number of samples required by the training text detection module is typically greater than 100. However, the ratio of the required number of samples of the current training method of the end-to-end text recognition model is less than 10, and the required amount of training samples is not satisfied. Thus, in these approaches, the lack of sample number for training the text recognition module results in over-fitting of the text detection module and under-fitting of the text recognition module, which greatly limits the accuracy of end-to-end text recognition.

Disclosure of Invention

In view of the above, the present invention provides an end-to-end text recognition method, an end-to-end text recognition model training method and an end-to-end text recognition model training device, so as to at least partially solve at least one of the above-mentioned technical problems.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

as an aspect of the present invention, there is provided an end-to-end text recognition method including:

constructing an initial end-to-end text recognition model, wherein the initial end-to-end text recognition model comprises an initial text detection module and an initial text recognition module;

acquiring a training sample data set;

processing training samples in the training sample data set by using a sample generation algorithm, and generating an amplification training sample data set so as to increase the number of training samples for training the initial text recognition module; and

and training the initial end-to-end text recognition model by utilizing the training sample data set and the augmented training sample data set to obtain an end-to-end text recognition model.

As another aspect of the present invention, an end-to-end text recognition method is further provided, where the recognition method is implemented based on an end-to-end text recognition model obtained by training with a training method, the end-to-end text recognition model includes a text detection module and a text recognition module, and the method includes:

inputting a text image to be detected into a text detection module of an end-to-end text recognition model to obtain word-level characteristics;

and inputting the character-level features into a text recognition module of the end-to-end text recognition model to obtain sequence information, wherein the sequence information is used for representing the content of the text in the text image.

As an aspect of the present invention, there is also provided a training apparatus for an end-to-end text recognition model, including:

the system comprises a construction module, a detection module and a recognition module, wherein the construction module is used for constructing an initial end-to-end text recognition model, and the initial end-to-end text recognition model comprises an initial text detection module and an initial text recognition module;

the acquisition module acquires a training sample data set;

the sample generation module is used for processing the training sample data set by utilizing a sample generation algorithm and generating an amplification training sample data set so as to increase the number of training samples for training the initial text recognition module; and

and the training module is used for training the initial end-to-end text recognition model by utilizing the training sample data set and the augmented training sample data set to obtain an end-to-end text recognition model, wherein the end-to-end text recognition model comprises a text detection module and a text recognition module.

Based on the technical scheme, the end-to-end text recognition method has the following positive effects:

a large number of training samples for training the text recognition module are generated by using a sample generation algorithm, so that the problems of over-fitting of the text detection module and under-fitting of the text recognition module are effectively solved, and the recognition accuracy of the end-to-end text recognition model is improved;

generating character-level labels from the character-level labels by using a weak supervised learning algorithm and training a text recognition module by using the generated labels;

the multi-level feature enhancement module is utilized to fuse the multi-level features and enhance the features so that the text detection module has stronger feature representation capability, and therefore the text can be detected more accurately.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates a flow diagram of an end-to-end text recognition model training method of an embodiment of the invention;

FIG. 2 schematically illustrates a flow diagram of an end-to-end text recognition method of an embodiment of the present invention;

FIG. 3 schematically illustrates a model framework diagram of an end-to-end text recognition model of an embodiment of the present invention;

FIG. 4 schematically illustrates a network architecture diagram of a multi-level feature enhancement module of an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating the recognition effect of an end-to-end text recognition method of an embodiment of the present invention on a data set ICDAR 2013;

FIG. 6 is a schematic diagram showing the recognition effect of the end-to-end text recognition method of the present invention on a data set ICDAR 2015;

FIG. 7 is a diagram schematically illustrating the recognition effect of the end-to-end Text recognition method according to the embodiment of the present invention on the data set Total-Text;

fig. 8 schematically shows a block diagram of a training apparatus of an end-to-end text recognition model according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It is to be understood that such description is merely illustrative and not intended to limit the scope of the present invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The invention provides a sample generation algorithm to generate samples for training a text recognition module on line according to character positions and label information. However, many existing data sets do not provide character-level labels, and therefore, the present invention proposes a weakly supervised learning strategy for generating character-level labels from the character-level labels, and then a sample generation algorithm generates an augmented character-level feature data set using the character-level labels generated by the weakly supervised learning strategy, and trains a text recognition module using the augmented character-level feature data set.

As an aspect of the present invention, a method for training an end-to-end text recognition model is provided, and referring to fig. 1, the method includes operations S101 to S104.

In operation S101, an initial end-to-end text recognition model is constructed, where the initial end-to-end text recognition model includes an initial text detection module and an initial text recognition module;

according to an embodiment of the present invention, the initial text detection module may include a feature extraction branch, a global text segmentation branch, a detection branch, and a word/character segmentation branch.

According to embodiments of the invention, the feature extraction branches may include a ResNet50 network and a FPN network.

In operation S102, a training sample data set is acquired;

in operation S103, processing the training samples in the training sample data set by using a sample generation algorithm, and generating an augmented training sample data set, so as to increase the number of training samples used for training the initial text recognition module; and

in operation S104, an initial end-to-end text recognition model is trained by using the training sample data set and the augmented training sample data set to obtain an end-to-end text recognition model.

According to the implementation of the invention, a large number of training samples for training the text recognition module are generated by using the sample generation algorithm, so that the problems of over-fitting of the text detection module and under-fitting of the text recognition module are effectively solved, and the recognition accuracy of the end-to-end text recognition model is improved.

According to the embodiment of the invention, the training sample data set comprises a pre-training sample data set, wherein pre-training samples in the pre-training sample data set comprise character-level marking information, and the character-level marking information comprises character-level position information and character-level label information;

processing training samples in the training sample data set by using a sample generation algorithm, wherein generating an amplification training sample data set comprises:

and processing the character-level marking information in the pre-training samples by using a sample generation algorithm to generate an amplification pre-training sample data set, wherein the amplification pre-training sample data set comprises a plurality of amplification pre-training samples.

According to the embodiment of the invention, training an initial end-to-end text recognition model by utilizing a training sample data set and an amplified training sample data set to obtain an end-to-end text recognition model comprises the following steps:

training the initial text detection module by utilizing a pre-training sample data set to obtain a pre-training text detection module;

and training the initial text recognition module by using the amplified pre-training sample data set to obtain a pre-training text recognition module.

According to the embodiment of the invention, because the training samples in the training sample data set, namely only a few word-level features (words) in the text image, can not enable the text recognition module to be converged, the invention utilizes the sample generation algorithm to generate more word-level features for training the text recognition module.

According to the embodiment of the invention, the input of the text recognition module is a character level feature, and the character level feature for inputting into the text recognition module may be that a character level label is provided in the training sample data set, and the character level label provided by the training sample data set is directly used (the labels include character level label information, namely, the content of the character level feature and the character level position information), that is, the feature corresponding to the character level label can be cut according to the character level label provided by the training sample data set. However, the method is not limited to this, the number of features clipped by using the word-level label provided by the sample data set is not enough, and the sample generation algorithm SGA may be used to generate more word-level features from the pre-training sample, so as to obtain the amplified pre-training sample data set.

According to the optional embodiment of the invention, the method for generating the word-level features by using the SGA can cut out the character-level features corresponding to each character according to the label information of the character and the corresponding character position information, and then randomly combine the character-level features to obtain the word-level features. For example, the word world cuts out the character-level features corresponding to w, o, r, l, d, and then combines them randomly, possibly combining them into word-level features such as world, wldor, oldwr, etc. These generated word-level features are then input to an initial text recognition module for training the initial text recognition module.

According to the embodiment of the invention, the amplification pre-training sample data set can comprise 100 amplification training samples, so that the number of training samples for training the initial text recognition module is remarkably increased, and the problems of text detection module overfitting and text recognition module underfitting are effectively solved.

According to the embodiment of the invention, the training sample data set further comprises a weak supervision training sample data set, wherein the training samples in the weak supervision training sample data set comprise word-level marking information, wherein the word-level marking information comprises word-level position information and word-level label information;

processing the training samples in the training sample data set by using a sample generation algorithm, and generating an amplification training sample data set further comprises:

processing training samples in the weakly supervised training sample data set by using a pre-training text detection module to generate predicted character-level labeling information; the predicted character-level marking information comprises predicted character-level position information and predicted character-level label information;

and processing the predicted character-level marking information by using a sample generation algorithm, and generating an amplification weak supervision training sample data set, wherein the amplification weak supervision training sample data set comprises a plurality of amplification weak supervision training samples so as to train the pre-training text recognition module by using the amplification weak supervision training sample data set.

According to the embodiment provided by the invention, a weak supervision method is innovatively used in a training task of an end-to-end text recognition model, and a sample is also innovatively generated on line in combination with a sample generation algorithm for training an initial text recognition module.

The method generates the character-level label from the character-level label in a weak supervision mode for training the end-to-end text recognition model, and generates the sample on line based on the character-level label by utilizing the sample generation algorithm SGA for training the text recognition module, so that the problems of over-fitting of the text detection module and under-fitting of the text recognition module are effectively solved, and the accuracy of end-to-end text recognition is improved.

According to an embodiment of the invention, for word-level labeling information in a weakly supervised training sample dataset that does not provide character-level labeling, a pre-trained text detection module may be utilized to generate predicted character-level labeling from the word-level labeling. The character-level labels (the character type and the character type position information) required by the sample generation algorithm SGA for generating the predicted character-level labels on line to generate the samples can include two sources, one is originally provided by the training sample data set and can be directly used.

The following explains the process of generating samples by the sample generation algorithm SGA with specific examples, and it should be noted that the following examples are only used for exemplifying the process of generating samples by the sample generation algorithm SGA, and do not limit the present invention in any way.

In accordance with an embodiment of the present invention, assuming that M character-level annotations are generated from the word-level annotations using weak supervised learning, N character-level annotations are randomly selected from the M character-level annotations using a sample generation algorithm SGA, where N < ═ M. Then, for N character-level labels randomly selected by a sample generation algorithm SGA, N character-level features corresponding to the N characters are determined according to the N character-level labels, and then the N character-level features are spliced into a new character-level feature X, so that a new sample is generated on line and used for training a text recognition module.

Assume that two texts, cat and zoo, are contained in one text image. After the text image is processed by using a weak supervision algorithm, a character-level label set { z, o, o, c, a, t } is obtained, N character-level labels are randomly selected from the character-level label set by using a sample generation algorithm SGA, and N character-level features corresponding to the N character-level labels are cut according to the N character-level labels to form words. For example, when N is 3, first, a character-level label, e.g., o, is randomly selected from the character-level label set { z, o, o, c, a, t }; then randomly selecting a character-level label, such as t, from the character-level label set { z, o, o, c, a, t }; finally, randomly selecting a character-level label, such as t, from the character-level label set { z, o, o, c, a, t }; through the above operation, three character-level labels o, t, and t are obtained, and character-level features o, t, and t corresponding to the three character-level labels o, t, and t are cut according to the three character-level labels o, t, and t to form a word ott. It should be noted that, when the sample is generated by using the sample generation algorithm SGA, it is not necessary to consider whether the generated sample is a correct word, and since an incorrect word has disorder, diversity of the sample can be increased, which is more beneficial to enhancing robustness of the text recognition module. To this end, the sample generation algorithm SGA generates a word, i.e. a sample, from the set of character-level labels { z, o, o, c, a, t }. The character-level labeling set obtained by detecting the branches can comprise M characters, and because the undersize of N is not favorable for the training of the text recognition module, the value of N can be as follows: and 3 < N < M.

In the above example, although there are two character-level labels o in the character-level label set { z, o, o, c, a, t }, the coordinates of each character-level label o in the text image are different, so the two character-level labels o can be regarded as two different individuals, and the first character-level label o or the other character-level label o can be selected after the first character-level label o is selected. Assuming that the set includes 6 character-level labels, when the sample generation algorithm SGA is used for selection, one of the 6 character-level labels in the character-level label set is selected each time, so that the generated sample can be ooooooo, and all characters are the same. The specific process can be as follows: each time a random number N is generated, for example a random number 3 is generated for the first time, so that 3 characters are randomly selected to form a new word, and the three characters are selected from all the characters each time, so that exactly 3 characters may be the same character. A second time again generates random numbers, which may be 3, 4, 5, 6. Then the corresponding features are cut again to form a new word. By analogy, in an embodiment of the invention, 100 new samples are generated for each text image using the sample generation algorithm SGA. In the 100 new samples generated for each text image by using the sample generation algorithm SGA, there may be several identical samples, for example, both generated samples are world, and at this time, each sample is input into the text recognition module as a separate sample without discarding, so as to ensure that the number of samples generated by the sample generation algorithm SGA is sufficient.

According to the embodiment of the invention, after the sample is generated by using the SGA, K features corresponding to K word-level labels are obtained according to the K word-level labels provided by the training sample data set. The K features obtained from the K word level labels provided by the sample data set and the 100 samples generated by the sample production algorithm SGA may both be input to the text recognition module for training the text recognition module.

According to an embodiment of the present invention, the text recognition module may be a text recognition module based on a 2-dimensional spatial attention mechanism. To prevent attention drift, so that attention is better aligned at each decoding stage, the present invention uses a position embedding mechanism so that the text recognition module can be made aware of the next character processed at each processing step. The output layer of the text recognition module may include 63 neurons, 10 neurons for outputting arabic numerals, 26 warp-lining elements for outputting english characters, 26 neurons for outputting punctuation marks, and a neuron for outputting a sequence cut-off, wherein the sequence cut-off is used for indicating the end of each word recognition process.

Since CTCs (connected semantic Temporal Classification) and text recognition models based on one-dimensional spatial attention mechanisms in the prior art cannot effectively recognize irregular text, such as curved text, embodiments of the present invention recognize text using a 2-D attention-based text recognition module, which performs a feature extraction operation using a convolutional neural network as an encoder, and decodes a character sequence using a 2-D attention-based decoder.

The principle of the position embedding mechanism is that the position of each character in a single word is calculated through a sine function and a cosine function, so that the 2-dimensional space attention mechanism focuses on the currently processed character instead of the character which is not processed, and the accuracy of text recognition is improved.

According to an alternative embodiment of the invention, the loss function used to train the parameters in the initial end-to-end text recognition model may be as follows:

L＝L_rpn+L_gseg+L_rcnn+L_mask+L_recog；

wherein L is_rpnAnd L_rcnnLoss functions of the RPN and detection branch, L, respectively_gsegIs the loss function of global text segmentation branch, and uses dice loss. L is_maskIs a loss function of word/character segmentation and division, and uses a binary cross entropy loss function, L_recogIs a loss function of the text recognition module.

Wherein L is_recogThe loss function of the text recognition module is as follows:

P(y_t)＝softmax(W_o×x_t+b_o)；

(x_t，s_t)＝RNN(s_t-1，r_t)；

wherein p (y)_t) Is the conditional probability of the result of the prediction per time step T, T denotes per time step, T denotes all times of the processing per character string, W_oRepresents weight, x_tDenotes the output of RNN at time t, b_oDenotes offset, S_tIndicating the hidden state of RNN at time t, r_tRepresenting the current input and the output of the last time step.

As another aspect of the present invention, referring to fig. 2, there is also provided an end-to-end text recognition method, where the text recognition method is implemented based on an end-to-end text recognition model trained by a training method, the end-to-end text recognition model includes a text detection module and a text recognition module, and the method includes operations S201 and S202.

In operation S201, inputting a text image to be detected into a text detection module of an end-to-end text recognition model to obtain a word-level feature;

in operation S202, the word-level features are input into a text recognition module of the end-to-end text recognition model, so as to obtain sequence information, where the sequence information is used to represent the content of the text in the text image.

According to the embodiment of the invention, the text detection module comprises a detection branch, a word/character segmentation branch and a global text segmentation branch; the character/character segmentation branch comprises a convolution unit and a multilevel characteristic enhancement unit;

inputting a text image to be detected into a text detection module of an end-to-end text recognition model, and obtaining character-level characteristics comprises the following steps:

processing the text image to be detected by using the detection branch to obtain a classification and regression feature map;

processing a text image to be detected by using a global text segmentation branch to obtain a global information characteristic diagram;

processing the text image to be detected by using a convolution unit in the character/character segmentation branch to obtain a character-level information characteristic diagram;

processing the global information characteristic diagram and the word level characteristic diagram by utilizing a multi-level characteristic enhancement unit to obtain a word/character segmentation diagram;

and fusing the classification and regression feature graph and the character/character segmentation graph to obtain the character-level features.

According to other embodiments of the invention, the text image to be detected can be processed only by using the detection branch to obtain the word-level labeling information, namely, the word-level label information and the word-level position information.

However, it should be noted that the word-level tag information and the word-level position information obtained by using the detection branch are not accurate enough and include noise.

According to the embodiment of the disclosure, accurate word-level feature output can also be realized by combining the detection branch, the global text segmentation branch and the word/character segmentation branch.

According to the embodiment of the invention, as shown in fig. 3, the global information feature map F4 and the word-level information feature map F3 are processed by the multi-level feature enhancing unit MFE in the word/character segmentation branch, so as to obtain the word/character segmentation map, which can determine the outline of the text in the text image to be detected. The value inside the contour can be predefined to be 1 and the value outside the contour can be predefined to be 0.

According to the embodiment of the invention, element-level multiplication and fusion are carried out on the word/character segmentation graph obtained by word/character segmentation branch processing and the classification and regression feature graph obtained by detection branch processing, so as to obtain the word-level features. The word/character segmentation graph comprises an accurate outline of a text in the text image to be detected, after element-level multiplication is carried out on the word/character segmentation graph and a classification and regression feature graph containing noise, a region corresponding to the inside of the outline of the word/character segmentation graph in the classification and regression feature graph is reserved, and a region corresponding to the outside of the outline is filtered, so that the noise in the classification and regression feature graph can be filtered, and more accurate word-level features can be obtained.

According to the embodiment of the invention, the text detection module further comprises a feature extraction branch, a regional suggestion network and a candidate box feature extraction network;

before the text image to be detected is processed by the detection branch, the word/character segmentation branch and the global text segmentation branch, the text image to be detected is input into a text detection module of the end-to-end text recognition model, and the word-level characteristics are obtained by the method further comprising the following steps:

processing the text image to be detected by utilizing the feature extraction branch to obtain a multi-scale feature map;

processing the multi-scale feature map by using global text segmentation branches to obtain a global information feature map;

processing the multi-scale feature map by using the regional suggestion network to generate at least one candidate frame;

extracting the characteristics of at least one candidate frame in the multi-scale characteristic diagram by using the candidate frame characteristic extraction network to obtain a second characteristic diagram;

and based on the second characteristic diagram, obtaining a word-level information characteristic diagram by using a convolution unit in the word/character segmentation branch.

According to the embodiment of the invention, as shown in fig. 3, the feature extraction branch may include a ResNet50 network and an FPN network, and after the text image to be tested is processed by using the ResNet50 network and the FPN network in the feature extraction branch, a multi-scale feature map F is obtained.

According to an embodiment of the present invention, the multi-scale feature map F is processed by using the region-suggested-network RPN, and the generated candidate frame R may be as shown by a black rectangular frame in the region-suggested-network RPN in fig. 3.

According to the embodiment of the present invention, after processing the multi-scale feature map F by using the regional suggestion network RPN, one, two or more candidate boxes may be generated, and the number of candidate boxes in fig. 3 is merely an example.

According to the embodiment of the invention, assuming that there is a text in the text image, it can be determined that the smallest horizontal rectangle for framing the text in the text image is a according to the annotation information, and assuming that a candidate box B is generated after processing the multi-scale feature map F by using the region suggestion network RPN, if the candidate box B satisfies the following condition: the area of intersection of the smallest horizontal rectangle a and the candidate box B divided by (the area of the smallest horizontal rectangle a + the area of the candidate box B-the area of intersection of the smallest horizontal rectangle a and the candidate box B) is greater than a preset threshold, for example 0.7, which proves that the candidate box B preliminarily frames the text and can therefore remain for subsequent further adjustment of the position of the candidate box B so that it just frames the text. It should be noted that the preset threshold value of 0.7 is only used as an example and does not limit the present invention.

After these rectangular boxes are processed, the confidence level is high. According to an embodiment of the present invention, the confidence may be the preset threshold. According to an embodiment of the present invention, a confidence greater than 0.6 may be determined as high.

According to the embodiment of the invention, for each reserved candidate frame, a second feature map F2 is obtained by extracting features corresponding to the candidate frame from the multi-scale feature map F by using the candidate frame feature extraction network RoIAlign.

According to the embodiment of the invention, the second feature map F2 of the detection branch processing can be also used for word/character classification and bounding box regression.

According to an embodiment of the present invention, the detection branch may include a 3 × 3 convolution kernel and a 1 × 1 convolution kernel, which are sequentially stacked.

According to the embodiment of the invention, after the second feature map F2 is subjected to convolution processing of 3 × 3 and 1 × 1, the classification and regression feature map is obtained by performing the classification of words/characters and the regression processing of word/character borders through parallel branches.

According to the embodiment of the invention, classification can be used for separating character contents framed by the candidate frame, namely character-level label information, if the character content framed by the candidate frame is a whole text (word-level), the character content framed by the candidate frame is roughly judged as the text, and then the character content framed by the candidate frame is input to a text recognition module for recognition. Bounding box regression may be used to predict the coordinates of the top left corner of a candidate box and the width and height of the candidate box, thereby determining the location of the text that the candidate box is bounding.

Obtaining the word-level information feature map using convolution units in the word/character segmentation branches, based on the second feature map F2, may include the following operations, according to an embodiment of the present invention.

The second feature map F2 is subjected to 3 × 3 convolution kernel processing by convolution units in a word/character segmentation branch to obtain a word-level information feature map F3.

According to the embodiment of the present invention, after the word-level information feature map F3 is obtained, the word-level information feature map F3 and the global information feature map F4 may be input into the multi-level feature enhancing unit MFE, and the word-level information feature map F3 and the global information feature map F4 are processed by the multi-level feature enhancing unit MFE to obtain the word/character segmentation map.

As shown in fig. 4, the multi-stage feature enhancing unit may include a first fusion layer 1, a first convolution layer 2, a second convolution layer 3, a third convolution layer 4, a fourth convolution layer 5, a fifth convolution layer 6, a sixth convolution layer 7, and a second fusion layer 8.

The word-level information feature map F3 and the global information feature map F4 are processed by the multi-level feature enhancement unit, and the word/character segmentation map is obtained by the following operations.

Processing the word-level information feature map F3 and the global information feature map F4 by using the first fusion layer 1 to obtain an initial fusion feature map; processing the initial fusion feature map by using the first convolution layer 2 to obtain a first fusion feature map; processing the first fused feature map by using the second convolution layer 3 and outputting a second fused feature map; processing the first fused feature map by using the third convolution layer 4 to obtain a third fused feature map; processing the third fused feature map by using the fourth convolution layer 5 to obtain a fourth fused feature map; processing the third fused feature map by using the fifth convolution layer 6 to obtain a fifth fused feature map; processing the fifth fused feature map by using the sixth convolution layer 7 to obtain a sixth fused feature map; and processing the second fusion graph, the fourth fusion graph and the sixth fusion characteristic graph by using the second fusion layer 8 to obtain a character/character segmentation graph.

According to an embodiment of the present invention, the first convolution layer 2 includes a 1 × 1 convolution kernel; the second convolution layer 3 comprises a 3x3 convolution kernel, wherein the void rate of the 3x3 convolution kernel is 1; the third convolutional layer 4 comprises 1 × 1 convolutional kernels; the fourth convolutional layer 5 comprises a 3x3 convolutional kernel, wherein the void rate of the 3x3 convolutional kernel is 2; the fifth convolutional layer 6 includes 1 × 1 convolutional kernels; the sixth convolutional layer 7 comprises a 3x3 convolutional kernel, wherein the void rate of the 3x3 convolutional kernel is 3.

According to an embodiment of the present invention, convolution kernels with void rates of 1, 2, and 3 represent sampling once every 1, 2, and 3 pixels, respectively.

According to an embodiment of the present invention, the first fusion layer 1 may include a first element-level addition module.

According to an embodiment of the present invention, the second fusion layer 8 may include a second element-level addition module.

According to the embodiment of the invention, the multi-level feature enhancement unit MFE is used for fusing multi-level features and enhancing the features so that the text detection module has stronger feature representation capability, and the text detection module can detect the text more accurately.

According to the embodiment of the invention, processing the multi-scale feature map F by using the global text segmentation branch to obtain the global information feature map F4 may include the following operations.

Sequentially performing upsampling and element set addition processing on the multi-scale feature map F to obtain a feature map F1 comprising multi-scale features and global information; and

and extracting the features of the feature map F1 according to the candidate frame by using the candidate frame feature extraction network RolAlign to obtain a global information feature map F4.

According to the optional embodiment of the invention, the feature map F1 is processed by using a 1x1 convolution kernel of the global text segmentation branch and a sigmoid function, so that a text segmentation map is obtained.

According to the embodiment of the present invention, processing the global information feature map F4 and the word-level information feature map F3 by using the multi-level feature enhancement unit MFE to obtain the word/character segmentation map may further include the following operations.

Processing the global information feature map and the word-level information feature map F3 by using a multi-level feature enhancement unit MFE to obtain an intermediate feature map F5;

the intermediate feature map F5 is processed using the second convolution unit and the third convolution unit of the word/character segmentation branch to obtain a word/character segmentation map.

According to an embodiment of the present invention, the second convolution unit may include a 3 × 3 convolution kernel; the third convolution unit may include a 63-pass 1x1 convolution kernel. According to an embodiment of the present invention, the 63 channels may include 10 channels for dividing numbers, 52 channels for dividing english characters, and 1 channel for full word division.

According to the embodiment of the invention, two rounds of pre-training can be firstly carried out on the end-to-end initial Text recognition model by utilizing two data sets of a synthetic data set SynthText and an ICDAR2013, and then the pre-trained end-to-end Text recognition model is predicted on the data sets of the ICDAR2015 and the Total-Text data set only with character-level labels, so that the character-level labels are obtained. Finally, the obtained character level labels and the provided character level labels are used for fine tuning 300k times of iteration on 10000 combined data sets consisting of SynthText, ICDAR2013, ICDAR2015, Total-Text and SCUT, so that a trained end-to-end Text recognition model is obtained

Compared with the existing methods, such as the recognition method used by Jaderberg and the like, the end-to-end text recognition method of the invention achieves the best effect on a plurality of data sets, and the recognition effect on the data set ICDAR2013 of the invention can be specifically seen in table 1.

TABLE 1

Wherein Detection is the result output by the text Detection module, and R, P and F are call, precision and F-measure, respectively. And the End-to-End is the result of End-to-End recognition and the result output by the End-to-End text recognition model. S, W, G are reference dictionaries representing the dictionaries strong, weak, general respectively, that the recognized word must be among the 100 words. Assuming that a word is world, recognized is world, and the strong dictionary may have world, wrold, wolrd, etc., a best match can be found from the inside as the final result using code rather than human. FPS is velocity, meaning that several images per second can be processed.

The detection result uses F as a comprehensive index, and as can be seen from Table 1, F of the invention is the highest, namely, the detection effect is better than that of other methods. For end-to-end recognition results, the text recognition method of the present invention works best when using a G dictionary, meaning that it does not rely on using a dictionary to correct the recognition results. The effect of the invention on the identification of the data set ICDAR2013 is shown in figure 5.

For inclined text ICDAR2015, F of the present invention is only 0.7% worse than the best result, but the end-to-end recognition result is the best when using three dictionaries of S, W, and G, which means that it has good end-to-end text recognition capability on inclined text. The recognition effect of the invention on a data set ICDAR2015 is shown in fig. 6. The specific parameter information for identifying the effect is shown in table 2.

TABLE 2

For curved Text Total-Text, F of the invention is only 0.3% worse than the best result, but the end-to-end recognition result is the best when a dictionary is not used (None) and all texts in a test image are used as the dictionary, which shows that the invention has good end-to-end Text recognition capability on the curved Text. The recognition effect of the invention on the data set Total-Text is shown in fig. 7. The identification effect specific parameter information is illustrated in table 3:

TABLE 3

Referring to fig. 8, as another aspect of the present invention, there is also provided an apparatus 800 for training an end-to-end text recognition model, including a building module 810, an obtaining module 820, a sample generating module 830, and a training module 840.

A constructing module 810, configured to construct an initial end-to-end text recognition model, where the initial end-to-end text recognition model includes an initial text detecting module and an initial text recognition module;

an obtaining module 820, obtaining a training sample data set;

the sample generation module 830 is configured to process the training sample data set by using a sample generation algorithm, and generate an augmented training sample data set, so as to increase the number of training samples used for training the initial text recognition module; and

the training module 840 is configured to train an initial end-to-end text recognition model by using the training sample data set and the augmented training sample data set to obtain an end-to-end text recognition model, where the end-to-end text recognition model includes a text detection module and a text recognition module.

As still another aspect of the present invention, an end-to-end text recognition apparatus is further provided, and the end-to-end text recognition apparatus includes a detection module and a recognition module.

The detection module is used for inputting the text image to be detected into the text detection module of the end-to-end text recognition model to obtain character-level characteristics; and

and the recognition module is used for inputting the character-level features into the text recognition module of the end-to-end text recognition model to obtain sequence information, and the sequence information is used for representing the content of the text in the text image.

It should be noted that, in the embodiment of the present invention, the training device portion of the end-to-end text recognition model corresponds to the training method portion of the end-to-end text recognition model in the embodiment of the present invention, and the description of the training device portion of the end-to-end text recognition model specifically refers to the training method portion of the end-to-end text recognition model, which is not described herein again.

It should be noted that the end-to-end text recognition device portion in the embodiment of the present invention corresponds to the end-to-end text recognition method portion in the embodiment of the present invention, and the description of the end-to-end text recognition device portion specifically refers to the end-to-end text recognition method portion, which is not described herein again.

The embodiments of the present invention have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the invention is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the invention, and these alternatives and modifications are intended to fall within the scope of the invention.

Claims

1. A training method of an end-to-end text recognition model comprises the following steps:

acquiring a training sample data set;

processing training samples in the training sample data set by using a sample generation algorithm, and generating an amplified training sample data set so as to increase the number of training samples for training the initial text recognition module; and

and training the initial end-to-end text recognition model by utilizing the training sample data set and the amplification training sample data set to obtain the end-to-end text recognition model.

2. The method of claim 1, wherein,

the training sample data set comprises a pre-training sample data set, wherein pre-training samples in the pre-training sample data set comprise character-level marking information, and the character-level marking information comprises character-level position information and character-level label information;

processing the training samples in the training sample data set by using a sample generation algorithm, and generating an amplification training sample data set comprises:

and processing the character-level marking information in the pre-training samples by using the sample generation algorithm to generate an amplification pre-training sample data set, wherein the amplification pre-training sample data set comprises a plurality of amplification pre-training samples.

3. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,

the training the initial end-to-end text recognition model by using the training sample data set and the augmented training sample data set to obtain the end-to-end text recognition model comprises:

training the initial text detection module by using the pre-training sample data set to obtain a pre-training text detection module;

4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,

the training sample data set further comprises a weak supervision training sample data set, wherein training samples in the weak supervision training sample data set comprise word-level marking information, and the word-level marking information comprises word-level position information and word-level label information;

processing training samples in the weakly supervised training sample data set by using the pre-training text detection module to generate predicted character-level labeling information; the predicted character-level marking information comprises predicted character-level position information and predicted character-level label information;

and processing the predicted character-level labeling information by using the sample generation algorithm to generate an amplification weak supervision training sample data set, wherein the amplification weak supervision training sample data set comprises a plurality of amplification weak supervision training samples, so that the pre-training text recognition module is trained by using the amplification weak supervision training sample data set.

5. An end-to-end text recognition method, wherein the method is implemented based on an end-to-end text recognition model trained by the training method according to any one of claims 1 to 4, the end-to-end text recognition model comprises a text detection module and a text recognition module, and the method comprises the following steps:

inputting a text image to be detected into a text detection module of the end-to-end text recognition model to obtain character-level characteristics;

and inputting the word-level features into a text recognition module of the end-to-end text recognition model to obtain sequence information, wherein the sequence information is used for representing the content of the text in the text image.

6. The method of claim 5, the text detection module comprising a detection branch, a word/character segmentation branch, and a global text segmentation branch; wherein the word/character segmentation branch comprises a convolution unit and a multilevel feature enhancement unit;

the step of inputting the text image to be detected into the text detection module of the end-to-end text recognition model to obtain the character-level features comprises the following steps:

processing the text image to be detected by using the global text segmentation branch to obtain a global information characteristic diagram;

processing the text image to be detected by using a convolution unit in the word/character segmentation branch to obtain a word-level information characteristic diagram;

processing the global information characteristic diagram and the word level characteristic diagram by utilizing the multi-level characteristic enhancement unit to obtain a word/character segmentation diagram;

and fusing the classification and regression feature map and the word/character segmentation map to obtain word-level features.

7. The method of claim 6, wherein the text detection module further comprises a feature extraction branch, a regional suggestion network, and a candidate box feature extraction network;

before the text image to be detected is processed by the detection branch, the word/character segmentation branch and the global text segmentation branch, the step of inputting the text image to be detected into the text detection module of the end-to-end text recognition model to obtain the word-level features further comprises the steps of:

processing the text image to be detected by using the feature extraction branch to obtain a multi-scale feature map;

processing the multi-scale feature map by using the global text segmentation branch to obtain a global information feature map;

processing the multi-scale feature map by using the regional suggestion network to generate at least one candidate box;

extracting the features corresponding to the at least one candidate frame in the multi-scale feature map by using the candidate frame feature extraction network to obtain a second feature map;

and obtaining a word-level information characteristic diagram by utilizing a convolution unit in the word/character segmentation branch based on the second characteristic diagram.

8. The method of claim 6, wherein the multi-level feature enhancement unit comprises a first fused layer, a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, and a second fused layer;

the processing the word-level information feature map and the global information feature map by using the multi-level feature enhancement unit to obtain the word/character segmentation map comprises:

processing the word-level information characteristic graph and the global information characteristic graph by using the first fusion layer to obtain an initial fusion characteristic graph;

processing the initial fusion feature map by using the first convolution layer to obtain a first fusion feature map;

processing the first fused feature map by using the second convolutional layer, and outputting a second fused feature map;

processing the first fused feature map by using the third convolutional layer to obtain a third fused feature map;

processing the third fused feature map by using the fourth convolutional layer to obtain a fourth fused feature map;

processing the third fused feature map by using the fifth convolutional layer to obtain a fifth fused feature map;

processing the fifth fused feature map by using the sixth convolutional layer to obtain a sixth fused feature map; and

and processing the second fusion graph, the fourth fusion graph and the sixth fusion characteristic graph by using the second fusion layer to obtain a character/character segmentation graph.

9. The method of claim 8, wherein:

the first convolution layer includes a 1x1 convolution kernel;

the second convolutional layer comprises a 3x3 convolutional kernel, wherein the void rate of the 3x3 convolutional kernel is 1;

the third convolutional layer comprises a 1x1 convolutional kernel;

the fourth convolutional layer comprises a 3x3 convolutional kernel, wherein the void rate of the 3x3 convolutional kernel is 2;

the fifth convolutional layer comprises a 1x1 convolutional kernel;

the sixth convolutional layer comprises a 3x3 convolutional kernel, wherein the void rate of the 3x3 convolutional kernel is 3.

10. An apparatus for training an end-to-end text recognition model, comprising:

the acquisition module acquires a training sample data set;

and the training module is used for training the initial end-to-end text recognition model by utilizing the training sample data set and the amplification training sample data set to obtain the end-to-end text recognition model, wherein the end-to-end text recognition model comprises a text detection module and a text recognition module.