CN112633422B

CN112633422B - Training method of text recognition model, text recognition method, device and equipment

Info

Publication number: CN112633422B
Application number: CN202110258666.7A
Authority: CN
Inventors: 李自荐; 秦勇
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-22
Anticipated expiration: 2041-03-10
Also published as: CN112633422A

Abstract

The invention provides a training method of a text recognition model, a text recognition method, a text recognition device and text recognition equipment. The training method comprises the following steps: constructing an initial model; the method comprises the steps that the output of first text image data through a recurrent neural network is used as character string input of a word embedding module, a first feature map obtained after the first text image data passes through a second convolutional neural network of a second part is used as the other input of the word embedding module, and an initial model is trained to obtain a converged initial model; obtaining a text recognition model based on the converged initial model; wherein the initial model includes a first portion for identifying textual content of the image, the first portion having a first convolutional neural network and a recurrent neural network; a second part for determining whether the given text is in the given image, the second part having a second convolutional neural network and a word embedding module; the device is used for executing the method. The training method can obtain the text recognition model which has high recognition speed and can give consideration to higher recognition precision.

Description

Training method of text recognition model, text recognition method, device and equipment

Technical Field

The invention relates to a text recognition technology, in particular to a training method of a text recognition model, a text recognition method, a text recognition device and equipment.

Background

Text detection and recognition have a wide application range, and are pre-steps of many computer vision tasks, such as image search, identity authentication, visual navigation, and the like, the main purpose of text detection is to locate a text line or a character in an image, while text recognition is to transcribe an image with a text line into a character string (to recognize the content of the character string), and accurate location and accurate recognition of a text are very important and challenging, because characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, fonts, colors, backgrounds, and the like compared with general target detection and recognition, algorithms which are often successful in general target detection and recognition cannot be directly transferred to character detection.

The recognition effect of the existing text recognition model and method is influenced by a plurality of factors, the recognition speed and the recognition precision are difficult to combine, and the requirement for rapid development of computer vision tasks cannot be met.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present invention provides a training method, a text recognition method, a device and an apparatus for a text recognition model.

The technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for training a text recognition model, including:

constructing an initial model, the initial model comprising:

a first portion for identifying textual content in an image, the first portion having a first convolutional neural network and a recurrent neural network;

a second section for determining whether a given text is in a given image, the second section having a second convolutional neural network and a word embedding model;

taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;

based on the converged initial model, a text recognition model is obtained.

In one embodiment, the training of the initial model to obtain a converged initial model with the output of the first text image data via the recurrent neural network as the character string input of the word embedding module and based on the first feature map obtained after the first text image data is passed through the second convolutional neural network of the second part as the other input of the word embedding module includes:

inputting the first text image data into a first convolution neural network and the recurrent neural network of a first part to obtain a character coding matrix;

inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;

and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training an initial model according to a first loss function to obtain a converged initial model.

In one embodiment, the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.

In one embodiment, the first part is a model obtained by training as follows:

training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text content in an image as a first part;

the pre-constructed first model is a model constructed based on the first convolutional neural network and the recurrent neural network, and the output of the first convolutional neural network is the input of the recurrent neural network.

In one embodiment, the second part is a model obtained by training as follows:

training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;

the pre-constructed second model is a model constructed based on a second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.

In one embodiment, training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data includes:

inputting the third text image data into the second convolutional neural network to obtain a second feature map;

inputting the word vector as a character string of the word embedding module, and inputting the second feature map as another input of the word embedding module; training the pre-constructed second model based on the first loss function; wherein the first loss function is a two-class cross entropy loss function.

In one embodiment, the word embedding module includes an attention-based codec.

In one embodiment, the first convolutional neural network comprises four Block blocks arranged in sequence;

the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same first size;

and the output of the first convolution neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.

In one embodiment, the second convolutional neural network comprises four Block blocks arranged in sequence;

the outputs of the four blocks are respectively down-sampled or up-sampled to enable the outputs of the four blocks to have the same second size;

the output of the second convolutional neural network is formed by adding corresponding position elements of the outputs of the four Block blocks point by point.

In one embodiment, the obtaining a text recognition model based on the converged initial model comprises:

building the text recognition model based on the first portion of the converged initial model; or the like, or, alternatively,

and taking the converged initial model as the text recognition model.

In a second aspect, an embodiment of the present invention provides a text recognition method, including:

inputting an image to be recognized into a pre-obtained text recognition model for text recognition, and outputting a text recognition result of the image to be recognized;

wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the embodiments of the first aspect.

In a third aspect, an embodiment of the present invention provides a training apparatus for a text recognition model, including:

the initial model building module is used for building an initial model;

the initial model training module is used for taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, taking a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, and training the initial model to obtain a converged initial model;

a text recognition model obtaining module, configured to obtain a text recognition model based on the converged initial model;

wherein the initial model comprises:

a second portion for determining whether a given text is in a given image, the second portion having a second convolutional neural network and a word embedding module.

In one embodiment, the initial model training module is specifically configured to:

inputting the first text image data into a first convolution neural network and a first circulation neural network of a first part to obtain a character coding matrix;

and inputting the character coding matrix and the first characteristic diagram into a word embedding module of a second part, and training the initial model according to a first loss function to obtain a converged initial model.

In one embodiment, the initial model building module comprises: the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;

In one embodiment, the initial model building module comprises:

the second part training module is used for training a pre-constructed second model according to a first loss function based on third text image data and word vectors corresponding to text labeling information of the third text image data, and obtaining a model capable of judging whether a given text is in a given image as a second part;

the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a two-class cross entropy loss function.

In one embodiment, the word embedding module includes an attention-based codec.

In an embodiment, the text recognition model obtaining module is specifically configured to:

and taking the converged initial model as the text recognition model.

In a fourth aspect, an embodiment of the present invention provides a text recognition apparatus, including:

the text recognition module is used for inputting the image to be recognized into a text recognition model obtained in advance for text recognition and outputting a text recognition result of the image to be recognized;

wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the embodiments of the above aspects.

In a fifth aspect, an embodiment of the present invention provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method in any one of the above-mentioned aspects.

In a sixth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of the above aspects.

The advantages or beneficial effects in the above technical solution at least include:

the technical scheme of the invention constructs an initial model by a first part with the capacity of identifying the text content of the image and a second part capable of judging whether the given text is in the given image, takes the output of the recurrent neural network of the first part as the input of a word embedding module of the second part, thereby realizing the combination of the first part and the second part, and trains the initial model to achieve the purpose of combined training. After training is finished, a text recognition model is obtained based on the converged initial model, so that the text recognition model indirectly has the capability of modeling the relationship between characters, the recognition speed is improved, and meanwhile, higher recognition accuracy is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow chart of a method of training a text recognition model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training method of a first part of an embodiment of the present invention;

FIG. 3 is a flow chart of a training method of a second part of the embodiment of the present invention;

FIG. 4 is a flow chart of a method of training an initial model in an embodiment of the invention;

FIG. 5 is a logic diagram of an apparatus for training a text recognition model according to an embodiment of the present invention;

FIG. 6 is a flow chart of a text recognition method of an embodiment of the present invention;

FIG. 7 is a logic diagram of a text recognition apparatus according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by a related server, and the following description takes an electronic device such as a server or a computer as an example of an execution subject.

Example one

Referring to fig. 1, an embodiment of the present invention provides a method for training a text recognition model, including:

step S1: constructing an initial model, wherein the initial model is composed of basic parts such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a word embedding module, the Convolutional Neural Network comprises a first Convolutional Neural Network and a second Convolutional Neural Network, the initial model comprises a first part and a second part, and the first part and the second part are respectively composed of a plurality of basic parts, wherein:

a first portion for identifying text content in an image, the first portion having a first convolutional neural network and a recurrent neural network;

a second part for determining whether the given text is in the given image, the second part having a word embedding module and a second convolutional neural network; the second convolutional neural network is obtained based on parameters of the first convolutional neural network;

step S2: the method comprises the steps that the output of first text image data through a recurrent neural network is used as character string input of a word embedding module, and an initial model is trained to obtain a converged initial model based on a feature map obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;

step S3: based on the converged initial model, a text recognition model is obtained.

According to the embodiment of the invention, an initial model is constructed by a first part with the capacity of recognizing the text content of the image and a second part capable of judging whether the given text is in the given image, the output of a recurrent neural network of the first part is used as the input of a word embedding module of the second part, so that the first part and the second part are combined, the initial model has the capacity of judging whether the given text is in the given image through the second part, and the first part also indirectly has the judgment capacity through the combined training, so that the capacity of modeling the relation between characters can be indirectly possessed, the recognition speed is improved, and meanwhile, the higher recognition accuracy is realized.

Fig. 2 illustrates a flowchart of a method for implementing the first part in the embodiment of the present invention, where the first part can be obtained as follows:

step S11: constructing a first model in advance based on a first convolution neural network and a recurrent neural network, wherein the output of the first convolution neural network is the input of the recurrent neural network;

step S12: and training a pre-constructed first model according to a second loss function based on the second text image data and the text label information of the second text image data to obtain an identification model capable of identifying text content in the image.

In this embodiment, the second loss function is a CTC loss function (CTC), enabling supervised training of the first portion based on the CTC loss function; it can be seen that the structure of the first part of this embodiment is similar to the crnn (volumetric recovery Neural network) structure when having a CTC loss function.

The CRNN is composed of a convolutional neural network, a cyclic neural network and a translation layer from bottom to top, wherein the convolutional neural network is responsible for extracting features from a picture with characters, the cyclic neural network is responsible for carrying out sequence prediction by using the features extracted by the convolutional neural network, the translation layer translates a sequence obtained by the cyclic neural network into a letter sequence, and a target function selects a CTC loss function; one advantage of CRNN is that it can be trained end-to-end despite incorporating different types of network architectures, and thus, the first part of this embodiment can also be trained end-to-end.

As an embodiment, a double-layer bidirectional LSTM network (LSTM) is adopted to construct a recurrent neural network, and the input of the recurrent neural network is a matrix of L × N; unlike the conventional model, in which the output of the convolutional neural network is a matrix of L × N, where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary, the first convolutional neural network in this embodiment includes four Block blocks arranged in sequence.

In this embodiment, a first text image is input into a first convolutional neural network, feature extraction is performed on each Block in the first convolutional neural network, after a corresponding feature map is obtained, the feature maps of four Block blocks are respectively subjected to down-sampling or up-sampling processing to obtain processed feature maps, and the processed feature maps of the four Block blocks have the same first size, so that the output of the first convolutional neural network meets the same length of the input of the cyclic neural network, in this embodiment, the feature maps of the first three Block blocks in the four Block blocks are respectively subjected to down-sampling processing to obtain processed feature maps, and the first size is an original input image 1/32; and then, the outputs of the four Block blocks are added point by point of corresponding position elements, so that the output of the first convolution neural network forms an L-N matrix which is used as the input of the recurrent neural network. Where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary.

FIG. 3 is a flow chart illustrating a method for implementing the second part of the embodiment of the present invention; the second part may be obtained according to the following way:

s13: pre-constructing a second model based on a second convolutional neural network and a word embedding module; wherein the output of the second convolutional neural network is the other input of the word embedding module;

s14: and training a pre-constructed second model according to the first loss function based on the third text image data and the word vectors corresponding to the text labeling information of the third text image data, so as to obtain a model capable of judging whether the given text is in the given image.

In this embodiment, the word embedding module of the second part includes a codec based on an Attention (Attention) mechanism, the codec may be regarded as a layer of simple RNN, the Attention part may be regarded as sandwiched between two RNNs, the RNN before the Attention part is a cyclic neural network of the first part, the RNN after the Attention part is a decoder, and the Attention part includes q, x, v, where v is a set of x and q is a query vector.

In this embodiment, when the second part is trained separately, a word vector obtained by encoding text information of a text image is used as a query vector q, and a matrix output by the second convolutional neural network is used as v, that is, another input of the word embedding module; when the first part and the second part are subjected to fusion training, namely an initial model is trained, the output of the cyclic neural network of the first part is used as a character string input of a word embedding module, namely a query vector q; and calculating the similarity between q and v to obtain an input weight value of each x in v, using the input weight value as the input of a decoder, decoding, and obtaining a loss value according to a first loss function so as to train the initial model.

The text information is coded by selecting a smooth One-Hot coding mode or a Word2vec coding mode, so that each character in the text information is coded into a vector with a specified length, One character string is a matrix with a fixed size, and compared with the common One-Hot coding mode, the positions of corresponding elements of the smooth One-Hot coding mode are 0.9, and the positions of other elements are 0.1.

In the embodiment of the present invention, before the initial model is trained, the second part is trained in advance, and when the second part is trained separately, parameters (including parameters of the second convolutional neural network and the word embedding module) of the second part are adjusted according to the text image data and the word vector corresponding to the label information of the text image data, so as to obtain the trained second part. That is, in step S14, training the second model that is constructed in advance according to the first loss function based on the third text image data and the word vector corresponding to the text label information of the third text image data includes:

inputting the third text image data into a second convolution neural network to obtain a second characteristic diagram;

the word vector is used as a character string of the word embedding module for input, and the second characteristic diagram is used as the other input of the word embedding module; training a pre-constructed second model based on the first loss function; the output of the word embedding module is connected with a Softmax function module; wherein the first loss function is a two-class cross entropy loss function.

In an ideal state, the predicted value and the true value of the word embedding module are consistent. The first loss function is a two-class cross-entropy loss function. Embodiments of the invention provide supervised training of the second part by training the second part to approach the ideal state as much as possible, followed by a Softmax function for obtaining the loss value.

As an embodiment, the second convolutional neural network comprises four Block blocks arranged in sequence;

in this embodiment, after a third text image is input into the second convolutional neural network and feature extraction is performed on each Block in the second convolutional neural network, a feature map is obtained, the feature maps of the four Block blocks are respectively subjected to down-sampling or up-sampling processing to respectively obtain processed feature maps, and the processed feature maps of the four Block blocks have the same second size, so that the output of the second convolutional neural network meets the same length of the input of the word embedding module, in this embodiment, the feature maps of the first two Block blocks in the four Block blocks are respectively subjected to down-sampling processing, and the feature map of the last Block is subjected to up-sampling processing to obtain the processed feature maps, so that the second size is the original input image 1/16; and then, the outputs of the four Block blocks are added point by point of corresponding position elements, so that the output of the second convolutional neural network forms an L-N matrix, and a feature map of the output of the second convolutional neural network is obtained and used as the input of the word embedding module.

Fig. 4 is a flowchart illustrating a training method of the initial model in the embodiment of the present invention, and in one embodiment, step S2: the method comprises the following steps of taking the output of first text image data through a recurrent neural network as character string input of a word embedding module, taking a first feature map obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, training an initial model to obtain a converged initial model, and obtaining the converged initial model, wherein the method comprises the following steps:

s21: inputting first text image data into a first convolution neural network and a circulation neural network of a first part to obtain a character coding matrix;

s22: inputting the first text image data into a second convolution neural network of a second part to obtain a first feature map;

s23: and inputting the character coding matrix and the first characteristic diagram into a word embedding module of the second part, obtaining a loss value according to the first loss function, and adjusting parameters of the first part and the second part according to the loss value to obtain a converged initial model.

In one embodiment, after the training of the second part is completed, before the step S2 is performed, the parameters of the recurrent neural network are fixed, so that when the initial model is trained, the parameters of the first convolutional neural network and the second convolutional neural network are fine-tuned, and the parameters of the recurrent neural network are not changed. In addition, when the initial model is trained, the CTC loss function of the first part can be masked or removed, and the loss value is obtained by the first loss function of the second part for supervised training, so that the first part and the second part achieve the purpose of fusion training.

In this embodiment, the character coding matrix obtained based on the first part and the first feature map obtained based on the second part are respectively used as two inputs of the word embedding module of the second part, and the whole initial model is supervised and trained according to the first loss function, so as to achieve the purpose of fusion training, and finally obtain the converged initial model, so that the initial model has a faster recognition speed and a higher recognition accuracy.

As an optional implementation manner of this embodiment, obtaining the text recognition model based on the converged initial model includes: constructing a text recognition model based on the converged first portion of the initial model; after fusion training, the first part has the capability of indirectly modeling the relationship between the characters, so that the recognition result with higher recognition accuracy can be output, and the recognition speed is improved without the need of distinguishing the second part.

As can be seen from the above, according to the training method of the embodiment, an initial model is formed by a first portion having the capability of recognizing the text content of an image and a second portion capable of determining whether a given text is in the given image, and sharing of two convolutional neural network parameters is implemented by using a first convolutional neural network of the first portion as an initial parameter of a second convolutional neural network of the second portion, and an output of a cyclic neural network of the first portion is used as an input of a word embedding module of the second portion, so as to implement fusion of the first portion and the second portion, and this initial model is trained, and the whole initial model (including the first portion and the second portion) is supervised and trained by using the same loss function, and finally the purpose of fusion training is achieved, wherein the second portion can be used for determining whether the given text is in the given by the first portion, so as to determine whether the text recognition result given by the first portion is correct, the function of discrimination is played. After training is finished, a text recognition model is constructed based on the first part, so that the text recognition model indirectly has the capability of modeling the relationship between characters, the recognition speed is improved, and meanwhile, the realization of higher recognition accuracy is realized.

As an optional implementation manner of this embodiment, obtaining the text recognition model based on the converged initial model includes: when the converged initial model is used as a text recognition model, the whole initial model is used as a text recognition model, when text recognition is carried out, the output (L matrix of N) of the first part of the recurrent neural network is used as the input of a word embedding module, a feature map obtained by combining the second part of the second recurrent neural network is used as the other input of the word embedding module, the word embedding module outputs the other L matrix of N, compared with the L matrix of N output by the recurrent neural network, the L matrix of N output by the word embedding module has more probability values of texts belonging to images, finally, the output of the word embedding module is decoded, and the maximum probability value is selected as the final text content, so that the recognition accuracy is further improved.

Example two

Referring to fig. 5, an embodiment of the present invention provides a training apparatus for a text recognition model, including:

the initial model building module is used for building an initial model;

the initial model training module is used for taking the output of the first text image data through the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;

wherein the initial model comprises:

As an optional implementation manner of this embodiment, the parameter of the second convolutional neural network is a parameter obtained based on the parameter of the first convolutional neural network.

As an optional implementation manner of this embodiment, the initial model building module includes: the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;

As an optional implementation manner of this embodiment, the initial model building module includes:

As an optional implementation manner of this embodiment, training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data includes:

As an optional implementation manner of this embodiment, the word embedding module includes a codec based on an attention mechanism.

As an optional implementation manner of this embodiment, the first convolutional neural network includes four Block blocks arranged in sequence;

As an optional implementation manner of this embodiment, the second convolutional neural network includes four Block blocks arranged in sequence;

As an optional implementation manner of this embodiment, the text recognition model obtaining module is specifically configured to:

and taking the converged initial model as the text recognition model.

The principle and function of each module in the device of the present embodiment are the same as those in the first embodiment, and the description of the present embodiment is not repeated.

EXAMPLE III

Referring to fig. 6, an embodiment of the present invention provides a text recognition method, including:

inputting the image to be recognized into a pre-obtained text recognition model for text recognition, and outputting a text recognition result of the image to be recognized;

wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of the preceding embodiments.

The embodiment is based on the text recognition model obtained by the method of any one of the foregoing embodiments, and the text recognition model can accurately recognize text information in an image, and a text image to be recognized is used as an input.

Example four

Referring to fig. 7, an embodiment of the present invention provides a text recognition apparatus, including:

The principle and function of each module in the apparatus of the present embodiment are the same as those in the foregoing embodiments, and the description of the present embodiment is not repeated.

EXAMPLE five

The present embodiment provides a readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method in any one of the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

EXAMPLE six

Referring to fig. 8, the present embodiment provides an electronic apparatus including: a processor and a memory, the memory storing instructions therein, the instructions being loaded and executed by the processor to implement the method of any of the above embodiments.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.

The memory may include read only memory and random access memory, and may also include non-volatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data rate Synchronous Dynamic Random Access Memory (ddr SDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (synclink DRAM), and DRRAM (Direct RAM).

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful for implementing logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of description and are not intended to limit the scope of the invention. Other variations or modifications will occur to those skilled in the art based on the foregoing disclosure and are within the scope of the invention.

Claims

1. A training method of a text recognition model is characterized by comprising the following steps:

constructing an initial model, the initial model comprising:

a second part for determining whether a given text is in a given image, the second part having a second convolutional neural network and a word embedding module;

taking the output of the first text image data through the first convolutional neural network and the recurrent neural network as the character string input of the word embedding module, and training the initial model to obtain a converged initial model based on a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module;

based on the converged initial model, a text recognition model is obtained.

2. The method of claim 1,

the method for acquiring the character string of the word embedding module by using the output of the first text image data through the first convolutional neural network and the recurrent neural network as the character string input of the word embedding module, and training the initial model to acquire a converged initial model based on a first feature map acquired by the first text image data through a second convolutional neural network of a second part as the other input of the word embedding module, includes:

3. The method of claim 1,

the parameters of the second convolutional neural network are parameters obtained based on the parameters of the first convolutional neural network.

4. The method of claim 1,

the first part is a model obtained by training as follows:

5. The method of claim 1,

the second part is a model obtained by training in the following way:

6. The method of claim 5,

training a pre-constructed second model according to a first loss function based on third text image data and a word vector corresponding to text annotation information of the third text image data, including:

7. The method of claim 1, wherein the word embedding module comprises an attention-based codec.

8. The method of claim 1,

the first convolutional neural network comprises four Block blocks which are arranged in sequence;

9. The method of claim 1,

the second convolutional neural network comprises four Block blocks which are arranged in sequence;

10. The method of claim 1,

the obtaining a text recognition model based on the converged initial model comprises:

and taking the converged initial model as the text recognition model.

11. A text recognition method, comprising:

wherein the pre-obtained text recognition model comprises: a text recognition model obtained based on the method of any one of claims 1 to 10.

12. An apparatus for training a text recognition model, comprising:

the initial model building module is used for building an initial model;

the initial model training module is used for taking the output of first text image data through a first convolutional neural network and a cyclic neural network as the character string input of the word embedding module, taking a first characteristic diagram obtained after the first text image data passes through a second convolutional neural network of a second part as the other input of the word embedding module, and training the initial model to obtain a converged initial model;

the text recognition model acquisition module is used for acquiring a text recognition model based on the converged initial model;

wherein the initial model comprises:

13. The apparatus of claim 12, wherein the initial model training module is specifically configured to:

14. The apparatus of claim 12,

15. The apparatus of claim 12, wherein the initial model building module comprises:

the first part training module is used for training a pre-constructed first model according to a second loss function based on second text image data and text labeling information of the second text image data to obtain an identification model capable of identifying text contents in an image as a first part;

16. The apparatus of claim 12, wherein the initial model building module comprises:

the pre-constructed second model is a model constructed based on the second convolutional neural network and the word embedding module, and the initial parameters of the second convolutional neural network are the parameters of the first convolutional neural network.

17. The apparatus of claim 16,

the word vector is used as the character string input of the word embedding module, the second feature map is used as the other input of the word embedding module, and the pre-constructed second model is trained on the basis of the first loss function; wherein the first loss function is a first loss function of a binary cross-entropy loss function.

18. The apparatus of claim 12, wherein the word embedding module comprises an attention-based codec.

19. The apparatus of claim 12,

20. The apparatus of claim 12,

21. The apparatus of claim 12, wherein the text recognition model obtaining module is specifically configured to:

and taking the converged initial model as the text recognition model.

22. A text recognition apparatus, comprising:

23. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.

24. An electronic device, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 11.