CN112270316B

CN112270316B - Character recognition, training method and device of character recognition model and electronic equipment

Info

Publication number: CN112270316B
Application number: CN202011012497.0A
Authority: CN
Inventors: 张婕蕾; 万昭祎; 姚聪
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2023-06-20
Anticipated expiration: 2040-09-23
Also published as: CN112270316A

Abstract

The invention provides a training method, a training device and electronic equipment for character recognition and a character recognition model, which relate to the technical field of image processing and comprise the following steps: processing the feature vector of the image to be identified through the attention model to obtain the attention weight value of each cyclic neural network; determining target input parameters of each recurrent neural network, wherein the target input parameters include: the characteristic vector of the image to be identified, or the characteristic vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network; the method comprises the steps of inputting target input parameters and attention weight values into each cyclic neural network for processing, obtaining character recognition results, and determining the character recognition result output by the last cyclic neural network as the character recognition result of an image to be recognized.

Description

Character recognition, training method and device of character recognition model and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and apparatus for training a character recognition model, and an electronic device.

Background

In recent years, scene character recognition is increasingly widely applied in the field of pattern recognition, and can be applied to the fields of image retrieval, intelligent transportation, man-machine interaction and the like.

Scene character recognition has been widely studied in recent decades, more and more methods are now being used for scene character recognition, and the accuracy of the scene character recognition method is continuously improved. However, the existing scene text recognition method has word dependency, that is, the output of the scene text recognition model is often affected by the corpus of the training set. For example, as shown in fig. 1, the left two graphs are training set corpus respectively, and the right two graphs are pictures to be identified respectively. As can be seen from the right hand graph, the model recognizes "UNIVERSII" as "UNIVERSITY", and the recognition process can indicate that the model is affected by the corpus of the training set, resulting in a recognition error.

Disclosure of Invention

Accordingly, the invention aims to provide a training method, a training device and electronic equipment for character recognition and a character recognition model, so as to solve the technical problem that the recognition accuracy is low due to the fact that the existing scene character recognition model is easily influenced by corpus of training sets.

In a first aspect, an embodiment of the present invention provides a text recognition method, applied to a text recognition model, where the text recognition model includes: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model; the method comprises the following steps: processing the feature vector of the image to be identified through the attention model to obtain the attention weight value of each cyclic neural network; determining target input parameters of each recurrent neural network, wherein the target input parameters comprise: the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network; and inputting the target input parameters and the attention weight values into each cyclic neural network for processing to obtain a character recognition result, and determining the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, wherein the character recognition result represents the probability that the character to be recognized belongs to each preset character.

Further, determining the target input parameters for each recurrent neural network includes: if it is determined that a corresponding target probability is preset for each cyclic neural network, judging whether the target probability is greater than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameters contain the character recognition result output by the last cyclic neural network; and if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of each cyclic neural network comprise the character recognition result output by the last cyclic neural network and the feature vector of the image to be recognized.

Further, determining the target probability corresponding to each recurrent neural network includes: randomly generating the target probability for each recurrent neural network by a probability generator; or alternatively; randomly generating the target probability for each cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: and the position information of each cyclic neural network in the plurality of cyclic neural networks, the attention weight value of each cyclic neural network and the characteristic vector of the image to be identified.

Further, if the input data of all the first recurrent neural networks does not include the output data of the first recurrent neural network connected with the previous one, the word recognition model further includes: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

In a second aspect, an embodiment of the present invention provides a training method for a text recognition model, where the text recognition model includes: the attention model is connected with each first circulating neural network, and the first circulating neural networks are connected in series, wherein the input data of part or all of the first circulating neural networks in the first circulating neural networks do not contain the output data of the last first circulating neural network connected with the first circulating neural network; the method comprises the following steps: processing the feature vectors of the corpus of the training set through the attention model to obtain the attention weight value of each first cyclic neural network; determining target input parameters of each first recurrent neural network, wherein the target input parameters comprise: the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network; training the character recognition model by using the target input parameters, the attention weight value and target label information to obtain the trained character recognition model, wherein the target label information is an actual character sequence contained in the training set corpus.

Further, determining the target input parameters for each first recurrent neural network includes: if it is determined that a corresponding target probability is preset for each first cyclic neural network, judging whether the target probability is larger than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameters contain a character recognition result output by the last first cyclic neural network or not; if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of the target first cyclic neural network comprise the character recognition result output by the last first cyclic neural network and the feature vector of the corpus of the training set.

Further, the method further comprises: randomly generating the target probability for each first recurrent neural network by a probability generator; or alternatively; randomly generating the target probability for each first cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: the position information of the target first cyclic neural network in the plurality of first cyclic neural networks, the attention weight value of the target first cyclic neural network and the feature vector of the training set corpus.

Further, if the input data of all the first recurrent neural networks in the plurality of first recurrent neural networks does not include the output data of the last connected first recurrent neural network, the word recognition model further includes: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

Further, the method further comprises: acquiring a character recognition result of the output of the last first cyclic neural network in the plurality of first cyclic neural networks to obtain a first output result; acquiring a character recognition result of the output of the last second cyclic neural network in the plurality of second cyclic neural networks to obtain a second output result; calculating a target loss value by using the first output result and the second output result; and training the character recognition model through the target loss value.

Further, the recurrent neural network is a long short-term memory network LSTM.

In a third aspect, an embodiment of the present invention provides a text recognition device, which is applied to a text recognition model, where the text recognition model includes: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model; the device comprises: the first processing unit is used for processing the feature vector of the image to be identified through the attention model to obtain an attention weight value of each cyclic neural network; a first determining unit, configured to determine a target input parameter of each recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network; the second processing unit is used for inputting the target input parameters and the attention weight values into each cyclic neural network for processing to obtain character recognition results, and determining the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, wherein the character recognition result represents the probability that the character to be recognized belongs to each preset character.

In a fourth aspect, an embodiment of the present invention provides a training device for a text recognition model, where the text recognition model includes: the attention model is connected with each first circulating neural network, and the first circulating neural networks are connected in series, wherein the input data of part or all of the first circulating neural networks in the first circulating neural networks do not contain the output data of the last first circulating neural network connected with the first circulating neural network; the device comprises: the third processing unit is used for processing the feature vectors of the corpus of the training set through the attention model to obtain the attention weight value of each first cyclic neural network; a second determining unit, configured to determine a target input parameter of each first recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network; the training unit is used for training the character recognition model by utilizing the target input parameters, the attention weight values and the target label information to obtain the trained character recognition model, wherein the target label information is an actual character sequence contained in the corpus of the training set.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of the first or second aspects when the computer program is executed by the processor.

In a sixth aspect, embodiments of the present invention provide a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the steps of the method of any one of the first or second aspects above.

As can be seen from the above description, the existing scene text recognition model is easily affected by the corpus of the training set, and the inventor finds that the output data of each recognition step of the scene text recognition model becomes the input of the next step, so that the existing scene text recognition model has a certain sequence modeling, that is, the language model is built. The structure of the language model in the existing scene text recognition model can cause the model to show stronger vocabulary dependency. Based on this, in the present application, a text recognition method is proposed.

In the character recognition method provided by the embodiment of the invention, the character recognition model is adopted to carry out character recognition on the image to be recognized, and the input data of part or all of the cyclic neural networks in the character recognition model no longer comprises the output data of the last cyclic neural network connected with the cyclic neural network, so that the technical effect of reducing the word dependence of the cyclic neural networks in the character recognition process is obtained, and the technical problem that the recognition accuracy is low due to the fact that the existing scene character recognition model is easily influenced by corpus of a training set is further solved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a text recognition result of the prior art;

fig. 2 is a schematic structural view of an electronic device according to an embodiment of the present invention;

FIG. 3 is a flow chart of a text recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a first word recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a second word recognition model according to an embodiment of the present invention;

FIG. 6 is a flow chart of a training method of a word recognition model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a third word recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a text recognition device according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a training device for a character recognition model according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

first, an electronic device 100 for implementing an embodiment of the present invention, which may be used to run the text recognition method or the training method of the text recognition model of the embodiments of the present invention, will be described with reference to fig. 2.

As shown in fig. 2, the electronic device 100 includes one or more processors 102, one or more memories 104. Optionally, the electronic device 100 may also include an input device 106, an output device 108, and an image capture device 110, which are interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 2 are exemplary only and not limiting, and that the electronic device may also have some of the components shown in fig. 2 or other components and structures not shown in fig. 2, as desired.

The processor 102 may be implemented in hardware in at least one of a digital signal processor (DSP, digital Signal Processing), field programmable gate array (FPGA, field-Programmable Gate Array), programmable logic array (PLA, programmable Logic Array) and ASIC (Application Specific Integrated Circuit), and the processor 102 may be a central processing unit (CPU, central Processing Unit) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement client functions and/or other desired functions in embodiments of the present invention as described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image acquisition device 110 is configured to acquire an image to be identified, where the image acquired by the image acquisition device is subjected to the text recognition method to obtain a character recognition result.

Example 2:

in accordance with an embodiment of the present invention, there is provided an embodiment of a text recognition method, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

It should be noted that, in the present application, the method may be applied to a text recognition model, where the text recognition model includes: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model.

As shown in fig. 4, which is a structural diagram of a character recognition model, it can be seen from fig. 4 that the character recognition model includes an attention model attention, a plurality of recurrent neural networks LSTM Long-Term Memory networks (Long Short-Term Memory), and it can be seen from fig. 4 that the plurality of recurrent neural networks are connected in series. In fig. 4, ht is the feature vector of the image to be identified, α _t Attention weight value s output for attention model attention _t-1 G is the output result of the cyclic neural network LSTM _t For feature vectors and attention of images to be identifiedThe result of the multiplication operation is carried out on the weight value.

The inventors have found that in the word recognition model shown in fig. 4, if the output of each recurrent neural network is used as the input of the next recurrent neural network except the last recurrent neural network, then the plurality of recurrent neural networks have a certain sequence modeling property, that is, the language model is built. This way of processing between the multiple recurrent neural networks will result in the word recognition model being more lexically dependent. Therefore, in the present application, the input data of some or all of the plurality of recurrent neural networks is set to contain no output data of the last recurrent neural network connected thereto. That is, the output result of the last recurrent neural network connected to the recurrent neural network is discarded for some or all of the plurality of recurrent neural networks.

Fig. 3 is a flowchart of a text recognition method according to an embodiment of the present invention. As shown in fig. 3, the method comprises the steps of:

And step S302, processing the feature vector of the image to be identified through the attention model to obtain the attention weight value of each cyclic neural network.

As shown in fig. 4, for the recurrent neural network LSTM2, the attention weight value is generated for LSTM2 by the attention model attention, and the specific generation process can be described as:

obtaining output result s of last cyclic neural network LSTM1 of cyclic neural network LSTM2 _t-1 Then, the feature vector ht of the image to be identified is obtained, and the result s is output _t-1 And the feature vector ht of the image to be identified are processed to obtain the attention weight value alpha of the cyclic neural network LSTM2 _t 。

Step S304, determining a target input parameter of each recurrent neural network, where the target input parameter includes: and the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network.

In this application, a part or all of the plurality of recurrent neural networks are set to no longer receive the output result of the last recurrent neural network. Thus, in this application, it is necessary to determine a target input parameter for each recurrent neural network. For example, determining the target input parameter of LSTM1 as the feature vector of the image to be identified, determining the target input parameter of LSTM2 as the feature vector of the image to be identified, determining the target input parameter of LSTM3 as the feature vector of the image to be identified, and the character recognition result output by the recurrent neural network LSTM 2.

Step S306, inputting the target input parameters and the attention weight values into each cyclic neural network for processing to obtain a character recognition result, and determining the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, wherein the character recognition result represents the probability that the character to be recognized belongs to each preset character.

In an optional embodiment of the present application, step S304, determining the target input parameters of each recurrent neural network includes the following process:

firstly, judging whether a corresponding target probability is set for each cyclic neural network in advance;

if it is determined that a corresponding target probability is preset for each cyclic neural network, continuing to judge whether the target probability is greater than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameters contain the character recognition result output by the last cyclic neural network. If it is determined that the corresponding target probability is not preset for each cyclic neural network, determining that the character recognition result output by the last cyclic neural network is not contained in the target input parameters.

And secondly, if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of each cyclic neural network comprise the character recognition result output by the last cyclic neural network and the feature vector of the image to be recognized.

In this application, the preset probability threshold may be set to 0.2, but may be set to another threshold, which is not particularly limited in this application, and the user may select according to actual needs.

In this application, as shown in fig. 5, for each recurrent neural network, except for the first recurrent neural network, a corresponding target sequence rand (1, 0) may be set at the input position thereof.

Assume that the preset probability threshold is 0.2, and in fig. 5, the target probability corresponding to LSTM2 is 0.5. As can be seen from the comparison, the target probability corresponding to LSTM2 is greater than the preset probability, and at this time, the output result of LSTM1 can be multiplied by 1 in the target sequence rand, and the multiplied result is transmitted to the input end of LSTM 2.

Assume that the preset probability threshold is 0.2, and in fig. 5, the target probability corresponding to LSTM2 is 0.1. As can be seen from comparison, the target probability corresponding to the LSTM2 is smaller than the preset probability, and at this time, the output result of the LSTM1 can be multiplied by 0 in the target sequence rand, so that the output result of the LSTM1 is not transmitted to the input end of the LSTM 2.

As is apparent from the above description, in the present application, a corresponding target probability is preset for each recurrent neural network, and then, an output result (for example, an output character) for each recurrent neural network is determined with a certain probability as input data of the next recurrent neural network. The method is equal to discarding the characters of the corpus of the training set, so that the method can slow down the dependence of a plurality of cyclic neural networks on the corpus of the training set.

In an alternative embodiment, the target probability corresponding to each recurrent neural network may be determined in several manners, which specifically includes:

a mode one,

The target probabilities are randomly generated for each recurrent neural network by a probability generator.

In one manner, a probability generator may be preset, which may randomly generate a corresponding target probability for each recurrent neural network in advance.

In one mode, a probability generator may be preset, and in a training stage of the word recognition model, the probability generator may also randomly generate a corresponding initial probability for each cyclic neural network in advance. Then, in the training process of the character recognition model, the numerical value of the initial probability can be adjusted, so that the precision of the character recognition model meets the preset requirement, and the initial probability meeting the preset requirement is determined as the target probability.

A second mode,

Randomly generating the target probability for each cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: and the position information of each cyclic neural network in the plurality of cyclic neural networks, the attention weight value of each cyclic neural network and the characteristic vector of the image to be identified.

In another alternative embodiment, a target neural network may be preset, the output data of the target neural network is the target probability of a plurality of cyclic neural networks, and the input data of the target neural network may be one or more of the following data: and the position information of each cyclic neural network in a plurality of cyclic neural networks, the attention weight value of each cyclic neural network and the feature vector of the image to be identified.

As shown in fig. 4, the plurality of recurrent neural networks are connected in sequence, and the location information of each recurrent neural network in the plurality of recurrent neural networks can be understood as the location of each recurrent neural network in the plurality of recurrent neural networks connected in sequence. For example, the location information of LSTM1 in the plurality of recurrent neural networks is "1", and for example, the location information of LSTM2 in the plurality of recurrent neural networks is "2", etc. The process of data processing of the recurrent neural network located at different positions and the importance thereof are different, and therefore, the position information may be used as the input of the target neural network.

In another optional embodiment of the present application, if the input data of all the first recurrent neural networks in the plurality of first recurrent neural networks does not include the output data of the first recurrent neural network connected thereto, the text recognition model further includes a target language model. As shown in fig. 7, the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

In this application, the output result of each first cyclic neural network is not sent to the next first cyclic neural network, and based on this, in this application, a second cyclic neural network is connected after each first cyclic neural network, so as to perform auxiliary training on a plurality of first cyclic neural networks through a plurality of second cyclic neural networks, and a specific training process is described in the following embodiments.

Example 3:

according to the embodiment of the invention, an embodiment of a training method of a character recognition model is provided.

In this application, the text recognition model includes: the attention model is connected with each first cyclic neural network, and the first cyclic neural networks are connected in series, wherein the input data of part or all of the first cyclic neural networks do not contain the output data of the last first cyclic neural network connected with the attention model. In this application, the recurrent neural network may be a long short term memory network LSTM.

FIG. 6 is a flow chart of a training method for a word recognition model according to an embodiment of the present invention. As shown in fig. 6, the method includes the steps of:

Step S602, feature vectors of the corpus of the training set are processed through the attention model, and the attention weight value of each first cyclic neural network is obtained.

As shown in fig. 4 or fig. 7, for the recurrent neural network LSTM2, the attention weight value is generated for LSTM2 by the attention model attention, and the specific generation process may be described as follows:

Step S604, determining a target input parameter of each first recurrent neural network, where the target input parameter includes: and the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network.

In this application, a part or all of the plurality of first recurrent neural networks is set to no longer receive the output result of the last first recurrent neural network. Thus, in this application, it is desirable to determine the target input parameters for each first recurrent neural network. For example, determining the target input parameter of LSTM1 as the feature vector of the image to be identified, determining the target input parameter of LSTM2 as the feature vector of the image to be identified, determining the target input parameter of LSTM3 as the feature vector of the image to be identified, and the character recognition result output by the recurrent neural network LSTM 2.

Step S606, training the word recognition model by using the target input parameter, the attention weight value and the target label information, to obtain the trained word recognition model, where the target label information is an actual word sequence contained in the corpus of the training set.

In the invention, the character recognition model is adopted to carry out character recognition on the image to be recognized, and the input data of part or all of the cyclic neural networks in the character recognition model does not contain the output data of the last cyclic neural network connected with the cyclic neural network, so that the technical effect of reducing the word dependence of the cyclic neural networks in the character recognition process is obtained, and the technical problem that the recognition accuracy is low due to the fact that the conventional scene character recognition model is easily influenced by corpus of a training set is further solved.

In an optional embodiment of the present application, step S604, determining the target input parameters of each first recurrent neural network includes the following process:

in the present application, it may be first determined whether a corresponding target probability is set in advance for each first recurrent neural network. If it is determined that the corresponding target probability is preset for each first cyclic neural network, continuing to judge whether the target probability is greater than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameter contains a character recognition result output by the last first cyclic neural network. If it is determined that the corresponding target probability is not preset for each first cyclic neural network, determining that the character recognition result output by the last first cyclic neural network is not contained in the target input parameters.

If the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of the target first cyclic neural network comprise the character recognition result output by the last first cyclic neural network and the feature vector of the corpus of the training set.

In this application, as shown in fig. 5, for each first recurrent neural network, a corresponding target sequence rand (1, 0) may be set at the input position thereof, except for the first recurrent neural network.

In the present application, after determining the target input parameter of each first recurrent neural network in the above-described manner, the word recognition model may be trained using the target input parameter, the attention weight value and the target label information, to obtain the word recognition model after training. In this application, the target tag information may be understood as an actual text sequence included in the corpus of the training set.

As is apparent from the above description, in the present application, a corresponding target probability is preset for each first recurrent neural network, and then, an output result (for example, an output character) for each first recurrent neural network is determined with a certain probability as input data of the next first recurrent neural network. By adopting the method, the discarding of the characters of the corpus of the training set is equal, so that the dependence of a plurality of first cyclic neural networks on the corpus of the training set can be slowed down.

In an alternative embodiment, the target probability corresponding to each first recurrent neural network may be determined in several manners, which specifically includes:

mode one:

the target probabilities are randomly generated for each first recurrent neural network by a probability generator.

In one manner, a probability generator may be preset, and the probability generator may randomly generate a corresponding target probability for each first recurrent neural network in advance.

Mode two:

randomly generating the target probability for each first cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: the position information of the target first cyclic neural network in the plurality of first cyclic neural networks, the attention weight value of the target first cyclic neural network and the feature vector of the training set corpus.

In another alternative embodiment, a target neural network may be preset, the output data of the target neural network is the target probability of a plurality of first cyclic neural networks, and the input data of the target neural network may be one or more of the following data: and the position information of each first cyclic neural network in the first cyclic neural networks, the attention weight value of each first cyclic neural network and the feature vector of the image to be identified.

As shown in fig. 4, the plurality of first recurrent neural networks are sequentially connected, and the location information of each first recurrent neural network in the plurality of first recurrent neural networks can be understood as the location of each first recurrent neural network in the plurality of first recurrent neural networks which are sequentially connected. For example, the location information of LSTM1 in the plurality of first recurrent neural networks is "1", and for another example, the location information of LSTM2 in the plurality of first recurrent neural networks is "2", etc. The process of data processing of the recurrent neural network located at different positions and the importance thereof are different, and therefore, the position information may be used as the input of the target neural network.

In the present application, the output result of each first recurrent neural network is not sent to the next first recurrent neural network, and based on this, in the present application, a second recurrent neural network is connected to each first recurrent neural network, so as to perform auxiliary training on a plurality of first recurrent neural networks through a plurality of second recurrent neural networks.

As shown in fig. 7, the processing procedure of the text recognition model is described as:

For each neural network of the plurality of first recurrent neural networks, the process is described as follows:

firstly, the attention model acquires feature vectors of a corpus of a training set, then acquires an output result of a previous first cyclic neural network, and determines the output result and the feature vectors of the corpus of the training set as input data of the current first cyclic neural network. The current first recurrent neural network then processes the input data to obtain an output (e.g., character recognition result), and the current first recurrent neural network then inputs the output into the attention model and the corresponding second recurrent neural network for processing.

For each of the plurality of second recurrent neural networks, the process is described as follows:

the current second cyclic neural network obtains the output result of the last second cyclic neural network and obtains the output result of the first cyclic neural network connected with the last second cyclic neural network; and processing the two output results to obtain the output result of the current second circulating neural network, and simultaneously, transmitting the output result of the current second circulating neural network to the next second circulating neural network for processing.

According to the above-described process, in the present application, the character recognition result of the output of the last first recurrent neural network in the plurality of first recurrent neural networks may be first obtained, so as to obtain a first output result. Then, acquiring a character recognition result of the output of the last second cyclic neural network in the plurality of second cyclic neural networks to obtain a second output result; next, calculating a target loss value using the first output result and the second output result; and finally, training the character recognition model through the target loss value.

From the above description, it is known that in the present application, a sequence model (i.e., a plurality of first recurrent neural networks) is separately built, and the separately built sequence model focuses on image features, while taking a target language model as an aid to the sequence model. The method weakens the language modeling capability of the output of the previous branch (namely, a plurality of first cyclic neural networks), and effectively relieves the dependency of the model on vocabulary.

Example 4:

the embodiment of the invention also provides a character recognition device which is mainly used for executing the character recognition method provided by the embodiment of the invention, and the character recognition device provided by the embodiment of the invention is specifically introduced below.

Fig. 8 is a schematic diagram of a character recognition device according to an embodiment of the present invention. The device is applied to a character recognition model, and the character recognition model comprises: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model.

As shown in fig. 8, the word recognition device mainly includes a first processing unit 81, a first determining unit 82, and a second processing unit 83, wherein:

a first processing unit 81, configured to process, through an attention model, feature vectors of an image to be identified, to obtain an attention weight value of each recurrent neural network;

a first determining unit 82, configured to determine a target input parameter of each recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network;

And a second processing unit 83, configured to input the target input parameter and the attention weight value to each cyclic neural network for processing, obtain a character recognition result, and determine the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, where the character recognition result indicates a probability that the character to be recognized belongs to each preset character.

Optionally, the first determining unit is configured to: if it is determined that a corresponding target probability is preset for each cyclic neural network, judging whether the target probability is greater than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameters contain the character recognition result output by the last cyclic neural network; and if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of each cyclic neural network comprise the character recognition result output by the last cyclic neural network and the feature vector of the image to be recognized.

Optionally, the first determining unit is further configured to: randomly generating the target probability for each recurrent neural network by a probability generator; or alternatively; randomly generating the target probability for each cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: and the position information of each cyclic neural network in the plurality of cyclic neural networks, the attention weight value of each cyclic neural network and the characteristic vector of the image to be identified.

Optionally, the device is further configured to: if the input data of all the first cyclic neural networks do not contain the output data of the last first cyclic neural network connected with the input data, the word recognition model further comprises: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

Example 5:

fig. 9 is a schematic diagram of a training device for a character recognition model according to an embodiment of the present invention. The character recognition model comprises: the attention model is connected with each first cyclic neural network, and the first cyclic neural networks are connected in series, wherein the input data of part or all of the first cyclic neural networks in the first cyclic neural networks do not contain the output data of the first cyclic neural network connected with the attention model.

As shown in fig. 9, the training device of the word recognition model mainly includes a third processing unit 91, a second determining unit 92 and a training unit 93, where:

a third processing unit 91, configured to process, by using an attention model, feature vectors of the corpus of training sets, to obtain an attention weight value of each first recurrent neural network;

a second determining unit 92, configured to determine a target input parameter of each first recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network;

and a training unit 93, configured to train the word recognition model by using the target input parameter, the attention weight value, and target label information, to obtain the trained word recognition model, where the target label information is an actual word sequence included in the corpus of the training set.

Optionally, the second determining unit is configured to: if it is determined that a corresponding target probability is preset for each first cyclic neural network, judging whether the target probability is larger than or equal to a preset probability threshold; the target probability is used for determining whether the target input parameters contain a character recognition result output by the last first cyclic neural network or not; if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of the target first cyclic neural network comprise the character recognition result output by the last first cyclic neural network and the feature vector of the corpus of the training set.

Optionally, the second determining unit is configured to: randomly generating the target probability for each first recurrent neural network by a probability generator; or alternatively; randomly generating the target probability for each first cyclic neural network through a target neural network, wherein input parameters of the target neural network comprise: the position information of the target first cyclic neural network in the plurality of first cyclic neural networks, the attention weight value of the target first cyclic neural network and the feature vector of the training set corpus.

Optionally, if the input data of all the first recurrent neural networks in the plurality of first recurrent neural networks does not include the output data of the last connected first recurrent neural network, the word recognition model further includes: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

Optionally, the device is further configured to: acquiring a character recognition result of the output of the last first cyclic neural network in the plurality of first cyclic neural networks to obtain a first output result; acquiring a character recognition result of the output of the last second cyclic neural network in the plurality of second cyclic neural networks to obtain a second output result; calculating a target loss value by using the first output result and the second output result; and training the character recognition model through the target loss value.

Optionally, the recurrent neural network is a long short term memory network LSTM.

The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of text recognition, characterized by being applied to a text recognition model, the text recognition model comprising: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model; the method comprises the following steps:

processing the feature vector of the image to be identified through the attention model to obtain the attention weight value of each cyclic neural network;

determining a target input parameter of each of the recurrent neural networks based on a target probability of each of the recurrent neural networks, wherein the target input parameter comprises: the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network, wherein the target probability is used for determining whether the target input parameter contains the character recognition result output by the last cyclic neural network;

and inputting the target input parameters and the attention weight values into each cyclic neural network for processing to obtain a character recognition result, and determining the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, wherein the character recognition result represents the probability that the character to be recognized belongs to each preset character.

2. The method of claim 1, wherein determining the target input parameter for each of the recurrent neural networks based on the target probability for each of the recurrent neural networks comprises:

if it is determined that a corresponding target probability is preset for each cyclic neural network, judging whether the target probability is greater than or equal to a preset probability threshold;

and if the target probability is greater than or equal to a preset probability threshold, determining that the target input parameters of each cyclic neural network comprise the character recognition result output by the last cyclic neural network and the feature vector of the image to be recognized.

3. The method of claim 2, wherein determining a target probability for each recurrent neural network comprises:

randomly generating the target probability for each recurrent neural network by a probability generator;

or alternatively;

4. The method of claim 1, wherein if the input data of all the first recurrent neural networks does not include the output data of the last connected first recurrent neural network, the word recognition model further comprises: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

5. A method for training a word recognition model, the word recognition model comprising: the attention model is connected with each first circulating neural network, and the first circulating neural networks are connected in series, wherein the input data of part or all of the first circulating neural networks in the first circulating neural networks do not contain the output data of the last first circulating neural network connected with the first circulating neural network; the method comprises the following steps:

Processing the feature vectors of the corpus of the training set through the attention model to obtain the attention weight value of each first cyclic neural network;

determining a target input parameter of each first recurrent neural network based on the target probability of each first recurrent neural network, wherein the target input parameter comprises: the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network, wherein the target probability is used for determining whether the target input parameter contains the character recognition result output by the last first cyclic neural network;

training the character recognition model by using the target input parameters, the attention weight value and target label information to obtain the trained character recognition model, wherein the target label information is an actual character sequence contained in the training set corpus.

6. The method of claim 5, wherein determining the target input parameter for each of the first recurrent neural networks based on the target probability for each of the first recurrent neural networks comprises:

If it is determined that a corresponding target probability is preset for each first cyclic neural network, judging whether the target probability is larger than or equal to a preset probability threshold;

7. The method of claim 6, wherein the method further comprises:

randomly generating the target probability for each first recurrent neural network by a probability generator;

or alternatively;

8. The method of claim 5, wherein if the input data of all of the plurality of first recurrent neural networks does not include the output data of the last connected first recurrent neural network, the word recognition model further comprises: a target language model; the target language model includes: the second cyclic neural networks are connected in series, input data of all the second cyclic neural networks in the second cyclic neural networks comprise output data of the last second cyclic neural network connected with the second cyclic neural network, and the second cyclic neural networks are connected with the first cyclic neural networks in a one-to-one correspondence mode.

9. The method of claim 8, wherein the method further comprises:

acquiring a character recognition result of the output of the last first cyclic neural network in the plurality of first cyclic neural networks to obtain a first output result;

acquiring a character recognition result of the output of the last second cyclic neural network in the plurality of second cyclic neural networks to obtain a second output result;

calculating a target loss value by using the first output result and the second output result;

and training the character recognition model through the target loss value.

10. The method according to any one of claims 5 to 9, wherein the recurrent neural network is a long short term memory network LSTM.

11. A word recognition device, characterized by being applied to a word recognition model, the word recognition model comprising: the attention model is connected with each circulating neural network, and the circulating neural networks are connected in series, wherein the input data of part or all of the circulating neural networks do not contain the output data of the last circulating neural network connected with the attention model; the device comprises:

The first processing unit is used for processing the feature vector of the image to be identified through the attention model to obtain an attention weight value of each cyclic neural network;

a first determining unit, configured to determine a target input parameter of each recurrent neural network based on a target probability of each recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the image to be identified and the character recognition result output by the last cyclic neural network of the current cyclic neural network, wherein the target probability is used for determining whether the target input parameter contains the character recognition result output by the last cyclic neural network;

the second processing unit is used for inputting the target input parameters and the attention weight values into each cyclic neural network for processing to obtain character recognition results, and determining the character recognition result output by the last cyclic neural network as the character recognition result of the image to be recognized, wherein the character recognition result represents the probability that the character to be recognized belongs to each preset character.

12. A training device for a character recognition model, wherein the character recognition model comprises: the attention model is connected with each first circulating neural network, and the first circulating neural networks are connected in series, wherein the input data of part or all of the first circulating neural networks in the first circulating neural networks do not contain the output data of the last first circulating neural network connected with the first circulating neural network; the training device comprises:

The third processing unit is used for processing the feature vectors of the corpus of the training set through the attention model to obtain the attention weight value of each first cyclic neural network;

a second determining unit, configured to determine a target input parameter of each first recurrent neural network based on a target probability of each first recurrent neural network, where the target input parameter includes: the feature vector of the image to be identified, or the feature vector of the corpus of the training set and the character recognition result output by the last first cyclic neural network of the target first cyclic neural network, wherein the target probability is used for determining whether the target input parameter contains the character recognition result output by the last first cyclic neural network;

the training unit is used for training the character recognition model by utilizing the target input parameters, the attention weight values and the target label information to obtain the trained character recognition model, wherein the target label information is an actual character sequence contained in the corpus of the training set.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the text recognition method according to any of the preceding claims 1 to 4 or the steps of the training method of the text recognition model according to any of the preceding claims 5 to 10 when the computer program is executed.

14. A computer readable medium having a non-volatile program code executable by a processor, the program code causing the processor to perform the steps of the word recognition method of any one of the preceding claims 1 to 4 or the training method of the word recognition model of any one of the preceding claims 5 to 10.