CN112633423B

CN112633423B - Training method of text recognition model, text recognition method, device and equipment

Info

Publication number: CN112633423B
Application number: CN202110258981.XA
Authority: CN
Inventors: 李自荐; 秦勇
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-22
Anticipated expiration: 2041-03-10
Also published as: CN112633423A

Abstract

The invention provides a training method of a text recognition model, a text recognition method, a text recognition device and text recognition equipment. The training method of the text recognition model comprises the following steps: inputting the first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value of the recognition model according to the first output result and label information corresponding to the first text image; inputting the first output result into a language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image; and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until the converged text recognition model is obtained. The text recognition model trained by the method can improve the recognition speed and realize better recognition precision.

Description

Training method of text recognition model, text recognition method, device and equipment

Technical Field

The invention relates to a text recognition technology, in particular to a training method of a text recognition model, a text recognition method, a text recognition device and equipment.

Background

Text detection and recognition have a wide application range, and are pre-steps of many computer vision tasks, such as image search, identity authentication, visual navigation, and the like, the main purpose of text detection is to locate a text line or a character in an image, while text recognition is to transcribe an image with a text line into a character string (to recognize the content of the character string), and accurate location and accurate recognition of a text are very important and challenging, because characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, fonts, colors, backgrounds, and the like compared with general target detection and recognition, algorithms which are often successful in general target detection and recognition cannot be directly transferred to character detection.

The recognition effect of the existing text recognition model and method is influenced by a plurality of factors, the recognition speed and the recognition precision are difficult to combine, and the requirement for rapid development of computer vision tasks cannot be met.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present invention provides a training method, a text recognition method, a device and an apparatus for a text recognition model.

The technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for training a text recognition model, including:

inputting a first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value of the recognition model according to the first output result and label information corresponding to the first text image;

inputting the first output result into a language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image;

and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.

In one embodiment, before inputting the first output result to the language model and obtaining the second output result, the method further includes: updating the language model as follows:

inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;

inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting the parameters of the initial language model based on the language loss value to obtain an updated language model.

In one embodiment, updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value includes:

adding the identification model Loss value and the language model Loss value to obtain a total Loss value;

updating parameters of the initial text recognition model based on the total Loss value.

In one embodiment, the initial text recognition model is obtained by training as follows:

training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;

the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.

In one embodiment, the convolutional neural network comprises four Block blocks arranged in sequence;

the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:

inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;

respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;

and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.

In one embodiment, the initial language model is obtained by training as follows:

acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image;

and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.

In one embodiment, the obtaining of the word vector sequence of the annotation text statement corresponding to the fourth text image includes:

and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.

In a second aspect, an embodiment of the present invention provides a text recognition method, including:

inputting a text image to be recognized into a pre-trained text recognition model for text recognition, and outputting a text recognition result of the text image to be recognized;

wherein the pre-trained text recognition model comprises: training the obtained text recognition model based on the method of any one of the embodiments of the first aspect.

In a third aspect, an embodiment of the present invention provides a training apparatus for a text recognition model, including:

the identification model Loss value acquisition module is used for inputting the first text image into a convolutional neural network of an initial text identification model to obtain a first output result and acquiring a Loss value of the identification model according to the first output result and the marking information corresponding to the first text image;

the language model Loss value acquisition module is used for inputting the first output result into a language model, acquiring a second output result and acquiring a language model Loss value according to the second output result and the annotation information corresponding to the first text image;

and the text recognition model updating module is used for updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.

In one embodiment, the apparatus further comprises:

the language model updating module is used for inputting the first output result into the language model and updating the language model in the following mode before obtaining a second output result:

In an embodiment, the text recognition model updating module is specifically configured to:

In a fourth aspect, an embodiment of the present invention provides a text recognition apparatus, including:

the text recognition module is used for inputting the text image to be recognized into a pre-trained text recognition model for text recognition and outputting a text recognition result of the text image to be recognized;

wherein the pre-trained text recognition model comprises: the obtained text recognition model is trained based on the method of any one of the embodiments of the above aspects.

In a fifth aspect, an embodiment of the present invention provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method in any one of the above-mentioned aspects.

In a sixth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of the above aspects.

The advantages or beneficial effects in the above technical solution at least include:

the technical scheme of the invention includes that a first text image is input into a convolutional neural network of a text recognition model to obtain a first output result, and a Loss value of the recognition model is obtained according to the first output result and label information corresponding to the first text image; the first output result is input into the language model to obtain a second output result, the language model Loss value is obtained according to the second output result and the labeling information corresponding to the first text image, and the parameters of the text recognition model are updated according to the recognition model Loss value and the language model Loss value, so that the combined training and the mutual game of the text recognition model and the language model are realized, the text recognition model has the capability of modeling the relationship between characters indirectly, the recognition speed is improved, and meanwhile, the better recognition precision is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic flow chart of a method of training a text recognition model of the present invention;

FIG. 2 is a logic diagram of the training apparatus of the text recognition model of the present invention;

FIG. 3 is a flow chart illustrating a text recognition method of the present invention;

fig. 4 is a logic diagram of the text recognition apparatus of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by a related server, and the following description takes an electronic device such as a server or a computer as an example of an execution subject.

Example one

Referring to fig. 1, the present embodiment provides a training method of a text recognition model, including:

inputting a first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value (Loss value, value of a Loss function) of the recognition model according to the first output result and label information corresponding to the first text image;

In the embodiment of the invention, the text image is an image with text information, the text information comprises a sentence with certain meaning or vocabulary information, and the marking information is used for expressing the sentence content corresponding to the text image. The annotation information can be used for manually annotating the text in the image, and the annotation information forms real data for training the text recognition model and the language model.

In the embodiment, a text recognition model and a language model are constructed, based on the game training idea of generating the countermeasure network, the text recognition model is regarded as a generator, the recognition result output by the text recognition model is regarded as generated data, the labeling information corresponding to the text in the image is regarded as real data, and the language model is regarded as a discriminator.

In the embodiment, a first text image is input into a convolutional neural network of a text recognition model to obtain a first output result, and a recognition model Loss value is obtained according to the first output result and label information corresponding to the first text image; the first output result is input into the language model to obtain a second output result, the language model Loss value is obtained according to the second output result and the labeling information corresponding to the first text image, and the parameters of the text recognition model are updated according to the recognition model Loss value and the language model Loss value, so that the combined training and the mutual game of the text recognition model and the language model are realized, the text recognition model has the capability of modeling the relationship between characters indirectly, the recognition speed is improved, and meanwhile, the better recognition precision is realized.

As a preferred implementation of this embodiment, updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value includes: adding the identification model Loss value and the language model Loss value to obtain a total Loss value; updating parameters of the initial text recognition model based on the total Loss value.

As a preferred implementation of this embodiment, the initial text recognition model includes a text recognition model obtained by training in the following manner:

the pre-constructed text recognition model is a text recognition model constructed based on the convolutional neural network and the connection time sequence Classification (CTC) function module, the convolutional neural network can extract the characteristics of the input image, and the connection time sequence Classification (CTC) function module is used for acquiring the Loss value of the recognition model.

The text recognition model of the embodiment is similar to the Rosetta principle, only uses two parts of the convolutional Neural network and the CTC function module, which is equivalent to removing the CRNN (probabilistic Recurrent Neural networks) of the bidirectional cyclic Neural network, so that the text recognition model has a faster recognition speed.

As a preferred embodiment of this embodiment, the convolutional neural network of the text recognition model uses Resnet18, where Resnet includes four Block blocks arranged in sequence, for example, four Block blocks connected in series in sequence, and features of the image are extracted through the Block blocks. Conventional Resnet, each Block includes several layers of convolution operations, the first Block outputting a feature map of 1/4, 1/8, 1/16 and 1/32 of the original input image.

In the embodiment of the invention, different from the conventional Resnet, a first text image is input into a convolutional neural network, feature extraction is carried out on each Block in the convolutional neural network, after a feature map is obtained, the feature maps of the first three blocks in the four blocks are respectively subjected to down-sampling treatment, and a down-sampling feature map is obtained, so that the down-sampling feature maps are the same as the feature map output by the fourth Block in size and are also the original input image 1/32; and performing point-by-point addition on the outputs of the four Block blocks, wherein the corresponding position elements are used as the output of the convolutional neural network, and obtaining a first output result.

According to the embodiment of the invention, the feature graphs of the first three Block blocks in the four Block blocks are subjected to down sampling, so that the feature graphs output by the four Block blocks have the same size, and the output of the four Block blocks can be subjected to point-by-point addition of corresponding position elements; the outputs of the convolutional neural network thus form a matrix of L x N, so that the first output results can be used as the input of the language model for calculating the Loss value of the language model, where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary.

In an optional implementation manner of this embodiment, the language model is obtained by performing a large amount of training based on a specific scene in advance, and is capable of accurately judging whether the input is a reasonable sentence. In the present embodiment, a language model that has been trained for a specific scene can be used, thereby reducing the training of the language model and improving the training efficiency of the text recognition model.

In an optional implementation manner of this embodiment, the language model and the text recognition model are synchronously trained in a combined manner, specifically, before the first output result is input to the language model and the second output result is obtained, the language model is updated as follows:

inputting the second text image into a convolutional neural network of the initial text recognition model to obtain a third output result;

and inputting the third output result into the initial language model to obtain a fourth output result, obtaining a language Loss value (Loss value) according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language Loss value to obtain the updated language model.

In the embodiment, the training method can be applied to any scene through the combined training of the text recognition model and the language model, and the obtained text recognition model can have higher recognition accuracy for the scene because the text recognition model and the language model are jointly trained in the same scene and synchronously promoted.

As a preferred embodiment, the initial language model is obtained by training according to the following way:

acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image; and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.

The obtaining of the word vector sequence of the labeled text statement corresponding to the fourth text image includes: and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.

The language model is a preset language model constructed based on a double-layer bidirectional LSTM (Long Short-Term Memory) model, and the constructed language model has the capability of preliminarily judging whether the input of the language model is a reasonable sentence or not by preliminarily training the preset language model.

In this embodiment, a large number of fourth text images with text information may be used to label these fourth text images to obtain labeling information, and Word vectors of each Word of each labeling information are obtained by operating the labeling information in a Word2vec (Word to vector) or other Word embedding manner.

The input of the language model of this embodiment is an encoded character string, the output is the probability that the character string is a reasonable word, the input of each node of the language model is an N-dimensional vector, the number of output nodes is the same as that of the input nodes, each output node corresponds to an N-dimensional vector, where N is the size of the dictionary.

As an optional implementation manner, the preliminary training is performed on the preset language model, including using a word vector corresponding to the annotation information of the fourth text image as an N-dimensional vector as the input of the language model, using the annotation information as real data to perform the preliminary training on the language model, obtaining a sentence corresponding to the maximum probability value through cluster search (Beamsearch) or greedy search, and performing the training by using a two-class cross entropy loss function to finally obtain a language model capable of preliminarily judging whether the input is a reasonable sentence.

Therefore, in the embodiment, the language model and the text recognition model are obtained respectively, and are trained respectively, then, the language model and the text recognition model are jointly trained, and the Loss value of the language model and the Loss value of the recognition model are obtained to be used as the basis for updating the parameters of the text recognition model, thereby realizing the combined training and mutual game of the text recognition model and the language model, simultaneously improving the recognition precision of the text recognition model and the accuracy of the judgment of the language model, when the text recognition model and the language model reach the game balance, i.e., completing the training of the text recognition model, which has the ability to indirectly model the relationships between characters, when the text recognition is carried out based on the text recognition model, the relation between characters does not need to be additionally modeled, so that the recognition speed is considered while the recognition accuracy is high.

Example two

Referring to fig. 2, the present embodiment provides a training apparatus for a text recognition model, including:

As an optional implementation, the apparatus further comprises:

and inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language loss value to obtain an updated language model.

As an optional implementation manner, the text recognition model updating module is specifically configured to:

adding the identification model Loss value and the language model Loss value to obtain a total Loss value; updating parameters of the initial text recognition model based on the total Loss value.

As an alternative embodiment, the initial text recognition model is obtained by training according to the following manner:

As an optional embodiment, the convolutional neural network comprises four Block blocks arranged in sequence;

As an alternative embodiment, the initial language model is obtained by training as follows: acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image; and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.

As an optional implementation manner, the obtaining of the word vector sequence of the annotation text statement corresponding to the fourth text image includes: and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.

The principle and function of each module in the device of the present embodiment are the same as those in the first embodiment, and the description of the present embodiment is not repeated.

EXAMPLE III

Referring to fig. 3, the present embodiment provides a text recognition method, including:

wherein the pre-trained text recognition model comprises: and training the obtained text recognition model based on the training method of the text recognition model.

The embodiment is based on that the training method in any one of the implementation manners of the embodiment trains the text recognition model to obtain the text recognition model indirectly having the capability of modeling the relationship between characters, the text recognition model can accurately recognize text information in an image, the text image to be recognized is used as input, and a final recognition result is directly obtained through the text recognition model, so that the recognition speed and the recognition precision are improved at the same time.

Example four

Referring to fig. 4, the present embodiment provides a text recognition apparatus, including:

EXAMPLE five

The present embodiment provides a readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method in any one of the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

EXAMPLE six

The present embodiment provides an electronic device, including: a processor and a memory, the memory storing instructions therein, the instructions being loaded and executed by the processor to implement the method of any of the above embodiments.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.

The memory may include read only memory and random access memory, and may also include non-volatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data rate Synchronous Dynamic Random Access Memory (ddr SDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (synclink DRAM), and DRRAM (Direct RAM).

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful for implementing logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

Furthermore, the terms "first", "second", "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of description and are not intended to limit the scope of the invention. Other variations or modifications will occur to those skilled in the art based on the foregoing disclosure and are within the scope of the invention.

Claims

1. A training method of a text recognition model is characterized by comprising the following steps:

inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language loss value to obtain an updated language model;

inputting the first output result into the updated language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image;

2. The method of claim 1,

updating parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value, including:

3. The method of claim 1,

the initial text recognition model is obtained by training according to the following mode:

4. The method of claim 1,

the convolutional neural network comprises four Block blocks which are arranged in sequence;

5. The method of claim 1, wherein the initial language model is obtained by training as follows:

6. The method of claim 5, wherein obtaining the word vector sequence of the annotated text sentence to which the fourth text image corresponds comprises:

7. A text recognition method, comprising:

wherein the pre-trained text recognition model comprises: training the obtained text recognition model based on the method of any of claims 1 to 6.

8. An apparatus for training a text recognition model, comprising:

the text recognition model updating module is used for updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained;

9. The apparatus of claim 8, wherein the text recognition model update module is specifically configured to:

10. The apparatus of claim 8,

11. The apparatus of claim 8,

12. The apparatus of claim 8, wherein the initial language model is obtained by training as follows:

13. The apparatus of claim 12, wherein said obtaining a word vector sequence of an annotated text sentence to which the fourth text image corresponds comprises:

14. A text recognition apparatus, comprising:

15. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

16. An electronic device, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 7.