CN112633423B - Training method of text recognition model, text recognition method, device and equipment - Google Patents

Training method of text recognition model, text recognition method, device and equipment Download PDF

Info

Publication number
CN112633423B
CN112633423B CN202110258981.XA CN202110258981A CN112633423B CN 112633423 B CN112633423 B CN 112633423B CN 202110258981 A CN202110258981 A CN 202110258981A CN 112633423 B CN112633423 B CN 112633423B
Authority
CN
China
Prior art keywords
text
recognition model
text recognition
loss value
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110258981.XA
Other languages
Chinese (zh)
Other versions
CN112633423A (en
Inventor
李自荐
秦勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yizhen Xuesi Education Technology Co Ltd
Original Assignee
Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yizhen Xuesi Education Technology Co Ltd filed Critical Beijing Yizhen Xuesi Education Technology Co Ltd
Priority to CN202110258981.XA priority Critical patent/CN112633423B/en
Publication of CN112633423A publication Critical patent/CN112633423A/en
Application granted granted Critical
Publication of CN112633423B publication Critical patent/CN112633423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention provides a training method of a text recognition model, a text recognition method, a text recognition device and text recognition equipment. The training method of the text recognition model comprises the following steps: inputting the first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value of the recognition model according to the first output result and label information corresponding to the first text image; inputting the first output result into a language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image; and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until the converged text recognition model is obtained. The text recognition model trained by the method can improve the recognition speed and realize better recognition precision.

Description

Training method of text recognition model, text recognition method, device and equipment
Technical Field
The invention relates to a text recognition technology, in particular to a training method of a text recognition model, a text recognition method, a text recognition device and equipment.
Background
Text detection and recognition have a wide application range, and are pre-steps of many computer vision tasks, such as image search, identity authentication, visual navigation, and the like, the main purpose of text detection is to locate a text line or a character in an image, while text recognition is to transcribe an image with a text line into a character string (to recognize the content of the character string), and accurate location and accurate recognition of a text are very important and challenging, because characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, fonts, colors, backgrounds, and the like compared with general target detection and recognition, algorithms which are often successful in general target detection and recognition cannot be directly transferred to character detection.
The recognition effect of the existing text recognition model and method is influenced by a plurality of factors, the recognition speed and the recognition precision are difficult to combine, and the requirement for rapid development of computer vision tasks cannot be met.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present invention provides a training method, a text recognition method, a device and an apparatus for a text recognition model.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for training a text recognition model, including:
inputting a first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value of the recognition model according to the first output result and label information corresponding to the first text image;
inputting the first output result into a language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image;
and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.
In one embodiment, before inputting the first output result to the language model and obtaining the second output result, the method further includes: updating the language model as follows:
inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;
inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting the parameters of the initial language model based on the language loss value to obtain an updated language model.
In one embodiment, updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value includes:
adding the identification model Loss value and the language model Loss value to obtain a total Loss value;
updating parameters of the initial text recognition model based on the total Loss value.
In one embodiment, the initial text recognition model is obtained by training as follows:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.
In one embodiment, the convolutional neural network comprises four Block blocks arranged in sequence;
the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:
inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;
respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;
and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.
In one embodiment, the initial language model is obtained by training as follows:
acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image;
and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
In one embodiment, the obtaining of the word vector sequence of the annotation text statement corresponding to the fourth text image includes:
and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
In a second aspect, an embodiment of the present invention provides a text recognition method, including:
inputting a text image to be recognized into a pre-trained text recognition model for text recognition, and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: training the obtained text recognition model based on the method of any one of the embodiments of the first aspect.
In a third aspect, an embodiment of the present invention provides a training apparatus for a text recognition model, including:
the identification model Loss value acquisition module is used for inputting the first text image into a convolutional neural network of an initial text identification model to obtain a first output result and acquiring a Loss value of the identification model according to the first output result and the marking information corresponding to the first text image;
the language model Loss value acquisition module is used for inputting the first output result into a language model, acquiring a second output result and acquiring a language model Loss value according to the second output result and the annotation information corresponding to the first text image;
and the text recognition model updating module is used for updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.
In one embodiment, the apparatus further comprises:
the language model updating module is used for inputting the first output result into the language model and updating the language model in the following mode before obtaining a second output result:
inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;
inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting the parameters of the initial language model based on the language loss value to obtain an updated language model.
In an embodiment, the text recognition model updating module is specifically configured to:
adding the identification model Loss value and the language model Loss value to obtain a total Loss value;
updating parameters of the initial text recognition model based on the total Loss value.
In one embodiment, the initial text recognition model is obtained by training as follows:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.
In one embodiment, the convolutional neural network comprises four Block blocks arranged in sequence;
the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:
inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;
respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;
and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.
In one embodiment, the initial language model is obtained by training as follows:
acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image;
and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
In one embodiment, the obtaining of the word vector sequence of the annotation text statement corresponding to the fourth text image includes:
and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
In a fourth aspect, an embodiment of the present invention provides a text recognition apparatus, including:
the text recognition module is used for inputting the text image to be recognized into a pre-trained text recognition model for text recognition and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: the obtained text recognition model is trained based on the method of any one of the embodiments of the above aspects.
In a fifth aspect, an embodiment of the present invention provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method in any one of the above-mentioned aspects.
In a sixth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of the above aspects.
The advantages or beneficial effects in the above technical solution at least include:
the technical scheme of the invention includes that a first text image is input into a convolutional neural network of a text recognition model to obtain a first output result, and a Loss value of the recognition model is obtained according to the first output result and label information corresponding to the first text image; the first output result is input into the language model to obtain a second output result, the language model Loss value is obtained according to the second output result and the labeling information corresponding to the first text image, and the parameters of the text recognition model are updated according to the recognition model Loss value and the language model Loss value, so that the combined training and the mutual game of the text recognition model and the language model are realized, the text recognition model has the capability of modeling the relationship between characters indirectly, the recognition speed is improved, and meanwhile, the better recognition precision is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a schematic flow chart of a method of training a text recognition model of the present invention;
FIG. 2 is a logic diagram of the training apparatus of the text recognition model of the present invention;
FIG. 3 is a flow chart illustrating a text recognition method of the present invention;
fig. 4 is a logic diagram of the text recognition apparatus of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by a related server, and the following description takes an electronic device such as a server or a computer as an example of an execution subject.
Example one
Referring to fig. 1, the present embodiment provides a training method of a text recognition model, including:
inputting a first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value (Loss value, value of a Loss function) of the recognition model according to the first output result and label information corresponding to the first text image;
inputting the first output result into a language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image;
and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.
In the embodiment of the invention, the text image is an image with text information, the text information comprises a sentence with certain meaning or vocabulary information, and the marking information is used for expressing the sentence content corresponding to the text image. The annotation information can be used for manually annotating the text in the image, and the annotation information forms real data for training the text recognition model and the language model.
In the embodiment, a text recognition model and a language model are constructed, based on the game training idea of generating the countermeasure network, the text recognition model is regarded as a generator, the recognition result output by the text recognition model is regarded as generated data, the labeling information corresponding to the text in the image is regarded as real data, and the language model is regarded as a discriminator.
In the embodiment, a first text image is input into a convolutional neural network of a text recognition model to obtain a first output result, and a recognition model Loss value is obtained according to the first output result and label information corresponding to the first text image; the first output result is input into the language model to obtain a second output result, the language model Loss value is obtained according to the second output result and the labeling information corresponding to the first text image, and the parameters of the text recognition model are updated according to the recognition model Loss value and the language model Loss value, so that the combined training and the mutual game of the text recognition model and the language model are realized, the text recognition model has the capability of modeling the relationship between characters indirectly, the recognition speed is improved, and meanwhile, the better recognition precision is realized.
As a preferred implementation of this embodiment, updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value includes: adding the identification model Loss value and the language model Loss value to obtain a total Loss value; updating parameters of the initial text recognition model based on the total Loss value.
As a preferred implementation of this embodiment, the initial text recognition model includes a text recognition model obtained by training in the following manner:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on the convolutional neural network and the connection time sequence Classification (CTC) function module, the convolutional neural network can extract the characteristics of the input image, and the connection time sequence Classification (CTC) function module is used for acquiring the Loss value of the recognition model.
The text recognition model of the embodiment is similar to the Rosetta principle, only uses two parts of the convolutional Neural network and the CTC function module, which is equivalent to removing the CRNN (probabilistic Recurrent Neural networks) of the bidirectional cyclic Neural network, so that the text recognition model has a faster recognition speed.
As a preferred embodiment of this embodiment, the convolutional neural network of the text recognition model uses Resnet18, where Resnet includes four Block blocks arranged in sequence, for example, four Block blocks connected in series in sequence, and features of the image are extracted through the Block blocks. Conventional Resnet, each Block includes several layers of convolution operations, the first Block outputting a feature map of 1/4, 1/8, 1/16 and 1/32 of the original input image.
In the embodiment of the invention, different from the conventional Resnet, a first text image is input into a convolutional neural network, feature extraction is carried out on each Block in the convolutional neural network, after a feature map is obtained, the feature maps of the first three blocks in the four blocks are respectively subjected to down-sampling treatment, and a down-sampling feature map is obtained, so that the down-sampling feature maps are the same as the feature map output by the fourth Block in size and are also the original input image 1/32; and performing point-by-point addition on the outputs of the four Block blocks, wherein the corresponding position elements are used as the output of the convolutional neural network, and obtaining a first output result.
According to the embodiment of the invention, the feature graphs of the first three Block blocks in the four Block blocks are subjected to down sampling, so that the feature graphs output by the four Block blocks have the same size, and the output of the four Block blocks can be subjected to point-by-point addition of corresponding position elements; the outputs of the convolutional neural network thus form a matrix of L x N, so that the first output results can be used as the input of the language model for calculating the Loss value of the language model, where L represents the number of characters of the longest string that can be recognized and N represents the size of the dictionary.
In an optional implementation manner of this embodiment, the language model is obtained by performing a large amount of training based on a specific scene in advance, and is capable of accurately judging whether the input is a reasonable sentence. In the present embodiment, a language model that has been trained for a specific scene can be used, thereby reducing the training of the language model and improving the training efficiency of the text recognition model.
In an optional implementation manner of this embodiment, the language model and the text recognition model are synchronously trained in a combined manner, specifically, before the first output result is input to the language model and the second output result is obtained, the language model is updated as follows:
inputting the second text image into a convolutional neural network of the initial text recognition model to obtain a third output result;
and inputting the third output result into the initial language model to obtain a fourth output result, obtaining a language Loss value (Loss value) according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language Loss value to obtain the updated language model.
In the embodiment, the training method can be applied to any scene through the combined training of the text recognition model and the language model, and the obtained text recognition model can have higher recognition accuracy for the scene because the text recognition model and the language model are jointly trained in the same scene and synchronously promoted.
As a preferred embodiment, the initial language model is obtained by training according to the following way:
acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image; and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
The obtaining of the word vector sequence of the labeled text statement corresponding to the fourth text image includes: and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
The language model is a preset language model constructed based on a double-layer bidirectional LSTM (Long Short-Term Memory) model, and the constructed language model has the capability of preliminarily judging whether the input of the language model is a reasonable sentence or not by preliminarily training the preset language model.
In this embodiment, a large number of fourth text images with text information may be used to label these fourth text images to obtain labeling information, and Word vectors of each Word of each labeling information are obtained by operating the labeling information in a Word2vec (Word to vector) or other Word embedding manner.
The input of the language model of this embodiment is an encoded character string, the output is the probability that the character string is a reasonable word, the input of each node of the language model is an N-dimensional vector, the number of output nodes is the same as that of the input nodes, each output node corresponds to an N-dimensional vector, where N is the size of the dictionary.
As an optional implementation manner, the preliminary training is performed on the preset language model, including using a word vector corresponding to the annotation information of the fourth text image as an N-dimensional vector as the input of the language model, using the annotation information as real data to perform the preliminary training on the language model, obtaining a sentence corresponding to the maximum probability value through cluster search (Beamsearch) or greedy search, and performing the training by using a two-class cross entropy loss function to finally obtain a language model capable of preliminarily judging whether the input is a reasonable sentence.
Therefore, in the embodiment, the language model and the text recognition model are obtained respectively, and are trained respectively, then, the language model and the text recognition model are jointly trained, and the Loss value of the language model and the Loss value of the recognition model are obtained to be used as the basis for updating the parameters of the text recognition model, thereby realizing the combined training and mutual game of the text recognition model and the language model, simultaneously improving the recognition precision of the text recognition model and the accuracy of the judgment of the language model, when the text recognition model and the language model reach the game balance, i.e., completing the training of the text recognition model, which has the ability to indirectly model the relationships between characters, when the text recognition is carried out based on the text recognition model, the relation between characters does not need to be additionally modeled, so that the recognition speed is considered while the recognition accuracy is high.
Example two
Referring to fig. 2, the present embodiment provides a training apparatus for a text recognition model, including:
the identification model Loss value acquisition module is used for inputting the first text image into a convolutional neural network of an initial text identification model to obtain a first output result and acquiring a Loss value of the identification model according to the first output result and the marking information corresponding to the first text image;
the language model Loss value acquisition module is used for inputting the first output result into a language model, acquiring a second output result and acquiring a language model Loss value according to the second output result and the annotation information corresponding to the first text image;
and the text recognition model updating module is used for updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.
As an optional implementation, the apparatus further comprises:
the language model updating module is used for inputting the first output result into the language model and updating the language model in the following mode before obtaining a second output result:
inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;
and inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language loss value to obtain an updated language model.
As an optional implementation manner, the text recognition model updating module is specifically configured to:
adding the identification model Loss value and the language model Loss value to obtain a total Loss value; updating parameters of the initial text recognition model based on the total Loss value.
As an alternative embodiment, the initial text recognition model is obtained by training according to the following manner:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.
As an optional embodiment, the convolutional neural network comprises four Block blocks arranged in sequence;
the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:
inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;
respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;
and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.
As an alternative embodiment, the initial language model is obtained by training as follows: acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image; and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
As an optional implementation manner, the obtaining of the word vector sequence of the annotation text statement corresponding to the fourth text image includes: and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
The principle and function of each module in the device of the present embodiment are the same as those in the first embodiment, and the description of the present embodiment is not repeated.
EXAMPLE III
Referring to fig. 3, the present embodiment provides a text recognition method, including:
inputting a text image to be recognized into a pre-trained text recognition model for text recognition, and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: and training the obtained text recognition model based on the training method of the text recognition model.
The embodiment is based on that the training method in any one of the implementation manners of the embodiment trains the text recognition model to obtain the text recognition model indirectly having the capability of modeling the relationship between characters, the text recognition model can accurately recognize text information in an image, the text image to be recognized is used as input, and a final recognition result is directly obtained through the text recognition model, so that the recognition speed and the recognition precision are improved at the same time.
Example four
Referring to fig. 4, the present embodiment provides a text recognition apparatus, including:
the text recognition module is used for inputting the text image to be recognized into a pre-trained text recognition model for text recognition and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: and training the obtained text recognition model based on the training method of the text recognition model.
EXAMPLE five
The present embodiment provides a readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method in any one of the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
EXAMPLE six
The present embodiment provides an electronic device, including: a processor and a memory, the memory storing instructions therein, the instructions being loaded and executed by the processor to implement the method of any of the above embodiments.
It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.
The memory may include read only memory and random access memory, and may also include non-volatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data rate Synchronous Dynamic Random Access Memory (ddr SDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (synclink DRAM), and DRRAM (Direct RAM).
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful for implementing logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
Furthermore, the terms "first", "second", "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of description and are not intended to limit the scope of the invention. Other variations or modifications will occur to those skilled in the art based on the foregoing disclosure and are within the scope of the invention.

Claims (16)

1. A training method of a text recognition model is characterized by comprising the following steps:
inputting a first text image into a convolutional neural network of an initial text recognition model to obtain a first output result, and obtaining a Loss value of the recognition model according to the first output result and label information corresponding to the first text image;
inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;
inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting parameters of the initial language model based on the language loss value to obtain an updated language model;
inputting the first output result into the updated language model to obtain a second output result, and acquiring a Loss value of the language model according to the second output result and the annotation information corresponding to the first text image;
and updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained.
2. The method of claim 1,
updating parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value, including:
adding the identification model Loss value and the language model Loss value to obtain a total Loss value;
updating parameters of the initial text recognition model based on the total Loss value.
3. The method of claim 1,
the initial text recognition model is obtained by training according to the following mode:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.
4. The method of claim 1,
the convolutional neural network comprises four Block blocks which are arranged in sequence;
the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:
inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;
respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;
and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.
5. The method of claim 1, wherein the initial language model is obtained by training as follows:
acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image;
and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
6. The method of claim 5, wherein obtaining the word vector sequence of the annotated text sentence to which the fourth text image corresponds comprises:
and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
7. A text recognition method, comprising:
inputting a text image to be recognized into a pre-trained text recognition model for text recognition, and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: training the obtained text recognition model based on the method of any of claims 1 to 6.
8. An apparatus for training a text recognition model, comprising:
the identification model Loss value acquisition module is used for inputting the first text image into a convolutional neural network of an initial text identification model to obtain a first output result and acquiring a Loss value of the identification model according to the first output result and the marking information corresponding to the first text image;
the language model Loss value acquisition module is used for inputting the first output result into a language model, acquiring a second output result and acquiring a language model Loss value according to the second output result and the annotation information corresponding to the first text image;
the text recognition model updating module is used for updating the parameters of the initial text recognition model based on the recognition model Loss value and the language model Loss value until a convergent text recognition model is obtained;
the language model updating module is used for inputting the first output result into the language model and updating the language model in the following mode before obtaining a second output result:
inputting a second text image into the convolutional neural network of the initial text recognition model to obtain a third output result;
inputting the third output result into an initial language model to obtain a fourth output result, obtaining a language loss value according to the fourth output result and the labeling information corresponding to the second text image, and adjusting the parameters of the initial language model based on the language loss value to obtain an updated language model.
9. The apparatus of claim 8, wherein the text recognition model update module is specifically configured to:
adding the identification model Loss value and the language model Loss value to obtain a total Loss value;
updating parameters of the initial text recognition model based on the total Loss value.
10. The apparatus of claim 8,
the initial text recognition model is obtained by training according to the following mode:
training a pre-constructed text recognition model based on a third text image and label information corresponding to the third text image to obtain the initial text recognition model;
the pre-constructed text recognition model is a text recognition model constructed based on a convolutional neural network and a connection time sequence classification CTC function module.
11. The apparatus of claim 8,
the convolutional neural network comprises four Block blocks which are arranged in sequence;
the inputting the first text image into the convolutional neural network of the initial text recognition model to obtain a first output result, including:
inputting the first text image into the convolutional neural network, and performing feature extraction through each Block Block in the convolutional neural network to obtain a feature map;
respectively performing down-sampling on the feature maps of the first three Block blocks in the four Block blocks to obtain down-sampled feature maps, wherein the down-sampled feature maps are the same as the feature maps output by the fourth Block in size;
and adding the downsampling feature map and the feature map output by the fourth Block point by point according to corresponding position elements to obtain a first output result of the convolutional neural network.
12. The apparatus of claim 8, wherein the initial language model is obtained by training as follows:
acquiring a word vector sequence of a labeling text sentence corresponding to the fourth text image;
and training a preset language model based on the word vector sequence and the labeled text sentence corresponding to the fourth text image to obtain the initial language model.
13. The apparatus of claim 12, wherein said obtaining a word vector sequence of an annotated text sentence to which the fourth text image corresponds comprises:
and performing word embedding on the labeled text sentence corresponding to the fourth text image to obtain the word vector sequence.
14. A text recognition apparatus, comprising:
the text recognition module is used for inputting the text image to be recognized into a pre-trained text recognition model for text recognition and outputting a text recognition result of the text image to be recognized;
wherein the pre-trained text recognition model comprises: training the obtained text recognition model based on the method of any of claims 1 to 6.
15. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
16. An electronic device, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 7.
CN202110258981.XA 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment Active CN112633423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110258981.XA CN112633423B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110258981.XA CN112633423B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Publications (2)

Publication Number Publication Date
CN112633423A CN112633423A (en) 2021-04-09
CN112633423B true CN112633423B (en) 2021-06-22

Family

ID=75297827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110258981.XA Active CN112633423B (en) 2021-03-10 2021-03-10 Training method of text recognition model, text recognition method, device and equipment

Country Status (1)

Country Link
CN (1) CN112633423B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230004741A1 (en) * 2021-06-30 2023-01-05 Konica Minolta Business Solutions U.S.A., Inc. Handwriting recognition method and apparatus employing content aware and style aware data augmentation
CN113963359B (en) * 2021-12-20 2022-03-18 北京易真学思教育科技有限公司 Text recognition model training method, text recognition device and electronic equipment
CN113963358B (en) * 2021-12-20 2022-03-04 北京易真学思教育科技有限公司 Text recognition model training method, text recognition device and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018125926A1 (en) * 2016-12-27 2018-07-05 Datalogic Usa, Inc Robust string text detection for industrial optical character recognition
EP3660733B1 (en) * 2018-11-30 2023-06-28 Tata Consultancy Services Limited Method and system for information extraction from document images using conversational interface and database querying
CN111259768A (en) * 2020-01-13 2020-06-09 清华大学 Image target positioning method based on attention mechanism and combined with natural language
CN111401375B (en) * 2020-03-09 2022-12-30 苏宁云计算有限公司 Text recognition model training method, text recognition device and text recognition equipment
CN111738251B (en) * 2020-08-26 2020-12-04 北京智源人工智能研究院 Optical character recognition method and device fused with language model and electronic equipment

Also Published As

Publication number Publication date
CN112633423A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112633423B (en) Training method of text recognition model, text recognition method, device and equipment
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
KR20190085098A (en) Keyword extraction method, computer device, and storage medium
CN110188223A (en) Image processing method, device and computer equipment
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
WO2022116436A1 (en) Text semantic matching method and apparatus for long and short sentences, computer device and storage medium
CN112380837B (en) Similar sentence matching method, device, equipment and medium based on translation model
CN112633422B (en) Training method of text recognition model, text recognition method, device and equipment
CN113297366B (en) Emotion recognition model training method, device, equipment and medium for multi-round dialogue
CN109522740B (en) Health data privacy removal processing method and system
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN113656547B (en) Text matching method, device, equipment and storage medium
CN112966685B (en) Attack network training method and device for scene text recognition and related equipment
CN111738270B (en) Model generation method, device, equipment and readable storage medium
US11507744B2 (en) Information processing apparatus, information processing method, and computer-readable recording medium
US11687712B2 (en) Lexical analysis training of convolutional neural network by windows of different lengths with matrix of semantic vectors
CN113449489B (en) Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN115393625A (en) Semi-supervised training of image segmentation from coarse markers
CN111357015B (en) Text conversion method, apparatus, computer device, and computer-readable storage medium
CN113239967A (en) Character recognition model training method, recognition method, related equipment and storage medium
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
CN111680132A (en) Noise filtering and automatic classifying method for internet text information
CN108829896B (en) Reply information feedback method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant