CN110210480B

CN110210480B - Character recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN110210480B
Application number: CN201910488332.1A
Authority: CN
Inventors: 万昭祎; 刘毅博; 谢锋明; 姚聪; 杨沐
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2021-08-10
Anticipated expiration: 2039-06-05
Also published as: CN110210480A

Abstract

The invention provides a character recognition method, a character recognition device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring an image to be detected, and extracting characteristic information of the image to be detected by adopting a full convolution neural network after two-dimensional CTC model training to obtain first characteristic information; the first characteristic information includes at least one of: representing the first character distribution probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to the first character sequence; representing a first path transition probability in a high dimension in a first two-dimensional spatial feature distribution; each feature point representing the first two-dimensional spatial feature distribution is a first initial path probability of an initial feature point on the first path; and determining a first character sequence in the image to be detected by using the first characteristic information of the image to be detected. The method and the device for identifying the image sequence solve the technical problem that the sequence prediction accuracy is low due to the occurrence of attention deviation in the existing image sequence identification method.

Description

Character recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for character recognition, an electronic device, and a computer-readable storage medium.

Background

Recognition of characters in a natural scene, hereinafter referred to as scene character recognition, refers to a technology for recognizing the contents of characters in a natural scene picture by using a computer algorithm, and is widely applied to a plurality of fields such as automatic driving, visual impairment assistance, identity authentication and the like. Unlike text recognition in scanned documents, text recognition in natural scenes faces even greater challenges: the complex natural background, uncertain character direction and arrangement, a great deal of color change and the like all enable the recognition precision and the realization difficulty of character recognition in a natural scene to be far higher than those of a scanned file.

In the prior art, a widely used image-based sequence recognition method is an attention-based model. In these attention models, a recurrent neural network with an attention mechanism is typically used to generate sequence predictions. Specifically, a character prediction is generated by focusing on a character region at each time step using an attention mechanism. Models based on this framework are also essentially an algorithm that outputs every frame, and the attention mechanism provides an alignment between feature representation and sequence prediction. However, such models are often faced with the problem of a more severe shift in attention: since the output and hidden state of the previous step directly participate in the calculation of the next prediction, the error prediction in the front of the sequence often causes the subsequent attention area shift, and further brings continuous error identification.

Disclosure of Invention

In view of the above, the present invention provides a text recognition method, a text recognition device, an electronic device, and a computer-readable storage medium, so as to alleviate the technical problem of low accuracy of sequence prediction caused by attention bias in the conventional image sequence recognition method.

In a first aspect, an embodiment of the present invention provides a text recognition method, including: acquiring an image to be detected, and extracting characteristic information of the image to be detected through a full convolution neural network after a two-dimensional CTC model is trained to obtain first characteristic information; wherein the first characteristic information includes at least one of: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is an initial feature point on a first path, and the first path is a path which is predicted in the first two-dimensional spatial feature distribution and can be aligned to a first character sequence; and determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected.

Further, the full convolution neural network includes: the device comprises a first convolution network, a pyramid pooling module and a second convolution network.

Further, the first convolution network is a residual convolution neural network, the residual convolution neural network includes a plurality of convolution modules, and some of the convolution modules include a hole convolution layer.

Further, extracting the feature information of the image to be detected by adopting a full convolution neural network after the two-dimensional CTC model training to obtain first feature information comprises the following steps: performing feature extraction on the image to be detected by using the first convolution network to obtain first convolution feature information; performing pooling calculation on the first convolution characteristic information by using the pyramid pooling module to obtain pooling characteristics of different scales, and performing cascade processing on the pooling characteristics of different scales to obtain pooling characteristic information; and carrying out convolution calculation on the pooled feature information by utilizing the second convolution network to obtain first feature information of the image to be detected.

Further, the method further comprises: acquiring a training sample image; extracting feature information of the training sample image through an initial full convolution neural network to obtain second feature information; the second characteristic information includes at least one of: a second character distribution probability, a second path transition probability and a second initial path probability, wherein the second character distribution probability is the probability that each feature point in a second two-dimensional space feature distribution of the training sample image belongs to a character in a second character sequence, and the second path transition probability represents the path selection probability in a high dimension in the second two-dimensional space feature distribution; the second initial path probability represents the probability that each feature point of the second two-dimensional spatial feature distribution is an initial feature point on a second path, and the second path is an effective path predicted in the second two-dimensional spatial feature distribution and capable of being aligned to a second character sequence; processing second characteristic information of the training sample image by using the two-dimensional CTC model to obtain a target loss function; and training the initial full convolution neural network through the target loss function to obtain the full convolution neural network.

Further, the processing the second feature information of the training sample image by using the two-dimensional CTC model to obtain the target loss function includes: processing the second characteristic information by using the two-dimensional CTC model to obtain the conditional probability of a second path; determining the objective loss function based on the conditional probability of the second path.

Further, the calculating the second feature information by using the two-dimensional CTC model to obtain a conditional probability of the second path includes: combining the dynamic programming algorithm and the information in the second characteristic information to calculate and obtain the target conditional probability beta_s,h,wWherein, β_s,h,wRepresenting the sum of the probabilities of all sub-paths from the position (h, w) of a second two-dimensional spatial feature distribution to the character at the s-th position in the second text sequence, wherein the second two-dimensional spatial feature distribution is the spatial feature distribution of the training sample image; using the target conditional probability beta_s,h,wA conditional probability of the second path is calculated.

Further, the step of calculating the target conditional probability by combining a dynamic programming algorithm and the information in the second feature information includes: calculating the target conditional probability beta using a target formula_s,h,wThe target formula is expressed as:

wherein the content of the first and second substances,

Ψ_j,w-1,hrepresenting said second path transition probability, representing a feature score from said second two-dimensional spaceThe transition probability of a feature point (j, w-1) in the cloth to a feature point (h, w) in the second two-dimensional spatial feature distribution, j representing a height coordinate in the second two-dimensional spatial feature distribution, Y^*And X' represents the expanded labeled character sequence and the second two-dimensional space characteristic distribution respectively, and s represents Y^*The serial number of the middle character, h represents another height coordinate in the second two-dimensional space characteristic distribution, and w represents a width coordinate in the second two-dimensional space characteristic distribution; h is equal to [1,2, … H ∈],w∈[1,2,…,W-1]H represents height information in the second two-dimensional spatial feature distribution, and W represents width information in the second two-dimensional spatial feature distribution;

a probability of belonging to the second character distribution, representing a probability that the feature point at position (h, w) belongs to a character in a second literal sequence; Ψ_j,0,hIs based on said second initial path probability Ψ_j,-1,hAnd (4) calculating.

Further, determining the objective loss function based on the conditional probability of the second path comprises: and determining the target loss function by using a formula, wherein the conditional probability of the second path is the target loss function.

In a second aspect, an embodiment of the present invention further provides a text recognition apparatus, including: the acquisition unit is used for acquiring an image to be detected; the extraction unit is used for extracting the characteristic information of the image to be detected through a full convolution neural network after the two-dimensional CTC model training to obtain first characteristic information; wherein the first characteristic information includes at least one of: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is an initial feature point on a first path, and the first path is a path which is predicted in the first two-dimensional spatial feature distribution and can be aligned to a first character sequence; and the determining unit is used for determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method in any one of the above first aspects when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method in any one of the above first aspects.

In the embodiment of the invention, firstly, an image to be detected is obtained, and the feature information of the image to be detected is extracted by adopting a full convolution neural network after the two-dimensional CTC model training to obtain first feature information, wherein the first feature information comprises at least one of the following information: a first character distribution probability, a first path transition probability and a first initial path probability; and finally, determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected. As can be seen from the above description, in the prior art, the sequence recognition in the image is recognized by the attention model, but such a model usually faces the problem of more serious attention shift, which leads to the shift of the subsequent attention area and thus continuous false recognition. However, in the present application, the two-dimensional CTC model is selected to retain the first feature information of the image during the training of the full-convolutional neural network, and directly predict the text sequence based on the first feature information. The two-dimensional CTC model reserves the first characteristic information of the image, and improves the identification precision of the full convolution network by using the mode of predicting the character sequence by the first characteristic information, thereby relieving the technical problem of low sequence prediction precision caused by attention deviation in the existing image sequence identification method.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of text recognition according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a two-dimensional feature distribution according to an embodiment of the present invention;

FIG. 4 is a histogram of a two-dimensional feature distribution in a schematic structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a full convolution neural network structure according to an embodiment of the present invention;

FIG. 6 is a block diagram of a prediction sequence according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a character recognition apparatus according to an embodiment of the invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

first, an example electronic device 100 for implementing a text recognition method of an embodiment of the present invention is described with reference to fig. 1.

As shown in fig. 1, electronic device 100 includes one or more processors 102 and one or more memory devices 104. Optionally, the electronic device may also include an input device 106, an output device 108, and a camera 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and an asic (application Specific Integrated circuit), and the processor 102 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other forms of Processing units having data Processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The camera 110 is configured to obtain an image to be detected, where the image to be processed obtained by the camera is processed by the text recognition method to obtain a text sequence in the image to be detected, for example, the camera may capture an image (e.g., a photo, a video, etc.) desired by a user, and then the image is processed by the text recognition method to obtain the text sequence in the image to be detected, and the camera may further store the captured image in the memory 104 for use by other components.

Exemplary electronic devices for implementing the text recognition method according to embodiments of the present invention may be implemented on mobile terminals such as smart phones, tablet computers, and the like.

Example 2:

in accordance with an embodiment of the present invention, there is provided an embodiment of a text recognition method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 2 is a flowchart of a text recognition method according to an embodiment of the present invention, as shown in fig. 2, the method includes the following steps:

step S202, acquiring an image to be detected.

In this embodiment, the image to be detected may be an image captured by the camera 110 in the electronic device described in the first embodiment, or may be an image received from another electronic device.

And step S204, extracting the characteristic information of the image to be detected by adopting a full convolution neural network after the two-dimensional CTC model training to obtain first characteristic information.

Wherein the first characteristic information includes at least one of: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is the starting feature point on the first path, and the first path is a path predicted in the first two-dimensional spatial feature distribution and capable of being aligned to the first text sequence.

As can be seen from the above description, in the prior art, a text sequence in an image is recognized by an attention model. In addition to this, the inventors also conceived that a Connection Timing Classification (CTC) may be applied to the character recognition method. However, the CTC model is originally designed for speech recognition, and since a speech signal to be recognized is a one-dimensional signal, a signal that can be processed by a processing formula of the conventional CTC model is a one-dimensional signal similar to a speech signal. For image-based word recognition problems, a contradiction arises that requires a one-dimensional distribution of the two-dimensional features of the image and the CTC model, so applying the CTC model directly to word recognition may lose important features and introduce additional noise.

Based on this, in the present application, the inventor expands the traditional CTC model, and proposes a new CTC model (i.e., a two-dimensional CTC model), where the two-dimensional CTC model can process two-dimensional features of an image, so that the two-dimensional features of the image can be retained, and a full convolution neural network can predict a more accurate text sequence, where the two-dimensional features of the image can be represented as a two-dimensional matrix, and each vector in the matrix is used to represent feature information of each pixel point in the image.

As can be seen from the above description, in the present application, the feature extraction may be performed on the image to be detected through the full convolution neural network, so as to obtain the first feature information. Wherein the first characteristic information includes: a first character distribution probability, a first path transition probability, and a first initial path probability.

It should be noted that, in this embodiment, the first two-dimensional spatial feature distribution is a feature distribution of the image to be detected, and the first two-dimensional spatial feature distribution may be a distribution structure as shown in fig. 3. That is, in the present application, the two-dimensional spatial feature distribution of the image to be detected may be a feature distribution structure with a height H and a width W.

In this embodiment, the first character distribution probability represents a probability that each feature point in the first two-dimensional spatial feature distribution contains a word in the first word sequence. For example, if included, the probability value is set to 1, otherwise to 0. The first path transition probability represents a path selection probability in a high dimension in the first two-dimensional spatial feature distribution, and may also be understood as representing a probability that each feature point in the first two-dimensional spatial feature distribution is located on the first path, where the first path is a predicted path that can be aligned to the first text sequence. The first initial path probability represents a probability that each feature point of the first two-dimensional spatial feature distribution is a starting feature point on the first path, wherein the first initial path probability may also be understood as a value of a leftmost position in the character distribution probabilities of each feature point in the first two-dimensional spatial feature distribution.

Step S206, determining the first character sequence in the image to be detected by using the first characteristic information of the image to be detected.

In this embodiment, after the first feature information is determined, a text sequence (i.e., a first text sequence) included in the image to be detected may be determined by combining the first feature information.

In the present embodiment, the conditional probability P (Y/X) may be calculated from the first feature information, wherein,

then, a route with the maximum probability can be found by using Greedy Search (Greedy Search) or segment Search, and the like, and the route with the maximum probability is determined as a first character sequence, a_X,YTo label all possible paths of the sequence Y under the predicted distribution X, t refers to the length of X. The Greedy Search (Greedy Search) method comprises the following calculation formula:

representing the result after multiplication of the probabilities of all characters on path pi.

Note that, in the present embodiment, the conditional probability

Representing the probability multiplication of all characters in a path,

represents the entire path A_X,YThe sum of the probability products of (c).

As can be seen from the above description, in the prior art, the sequence in the image is identified by the attention model, but such a model usually faces the problem of more serious attention shift, which leads to the shift of the subsequent attention area and thus continuous false identification. However, in the present application, the two-dimensional CTC model is selected to retain the first feature information of the image during the training of the full-convolutional neural network, and directly predict the text sequence based on the first feature information. The two-dimensional CTC model reserves the first characteristic information of the image, and improves the identification precision of the full convolution network by using the mode of predicting the character sequence by the first characteristic information, thereby relieving the technical problem of low sequence prediction precision caused by attention deviation in the existing image sequence identification method.

Further, the inventors thought that it is possible to identify sequences in an image in conjunction with a CTC model, however, the processing formulas of conventional CTC models can only process one-dimensional signals. Based on this, in the application, the traditional CTC model is expanded, and the two-dimensional characteristics of the image are processed through the expanded two-dimensional CTC model, so that the two-dimensional characteristics of the image can be reserved, and a more accurate character sequence can be predicted by a full convolution neural network.

As can be seen from the above description, in the present application, the feature information of the image to be detected is extracted through the full convolution neural network.

In an alternative embodiment, the full convolution neural network includes: the device comprises a first convolution network, a pyramid pooling module and a second convolution network. In this embodiment, the full convolution neural network is a pyramid-like structure.

In the present application, the first convolutional network may be a multi-layer residual convolutional neural network, for example, a 50-layer residual convolutional neural network. The multilayer residual convolutional neural network comprises a plurality of convolutional modules, and part of the convolutional modules in the convolutional modules comprise hole convolutional layers.

It should be noted that, in this embodiment, the multi-layer residual convolutional neural network includes convolution modules in multiple stages, and a part of the convolution modules in multiple stages includes a hole convolutional layer. Alternatively, the hole convolution layers may be disposed in the convolution modules of the last two stages of the convolution modules of the plurality of stages. In addition, the void convolution layer may be provided in the convolution module at another stage, which is not particularly limited in this embodiment.

FIG. 5 is a schematic block diagram of an alternative full convolution neural network. In the full convolution neural network shown in fig. 5, an image to be detected sequentially passes through a first convolution network (i.e., the multilayer residual convolution neural network shown in the figure), a pyramid pooling module and a second convolution network, and finally obtains feature information of the image to be detected, i.e., first feature information.

As shown in fig. 5, in the present embodiment, the first convolution network selects a multi-layer residual convolution neural network (e.g., 50-layer residual convolution neural network) including convolution modules of 5 stages. In this embodiment, the convolution modules in the fourth and fifth stages may use hole convolution to prevent the resolution of the feature representation of the image to be detected from being reduced too fast. After several stages of convolution modules, the characteristics of the image to be detected indicate that a sufficient receptive field is obtained. Like most segmentation models, the computing algorithm of the full convolution neural network uses a pyramid-like structure, i.e., after the last layer of convolution, the feature representations of the image to be detected are averaged and pooled to different sizes, and then the features of different scales are connected in series, so that uniform features are obtained through shared convolution operation. With the obtained characteristics, the three different outputs are respectively subjected to convolution of a layer of 3x3 and a layer of 1x1 to obtain a final output.

It should be noted that, in this embodiment, the second convolutional network may include two convolutional layers, and convolution kernels of the two convolutional layers may be respectively selected as: the convolution kernel of 3x3 and the convolution kernel of 1x1 may be selected from convolution kernels of other sizes, which is not particularly limited in this embodiment.

Based on this, in this embodiment, in step S204, extracting the feature information of the image to be detected by using the full convolution neural network after the two-dimensional CTC model training to obtain the first feature information includes the following steps:

step S2041, utilizing the first convolution network to perform feature extraction on the image to be detected to obtain first convolution feature information;

step S2042, performing pooling calculation on the first convolution feature information by using the pyramid pooling module to obtain pooling features of different scales, and performing cascade processing on the pooling features of different scales to obtain pooling feature information;

step S2043, carrying out convolution calculation on the pooled feature information by utilizing the second convolution network to obtain first feature information of the image to be detected.

Specifically, in this embodiment, a 50-layer residual convolutional neural network in the full-convolutional neural network shown in fig. 5 may be adopted to perform feature extraction on an image to be detected, so as to obtain first convolutional feature information. Due to the fact that the hole convolution is arranged in the 4 th stage and the 5 th stage of the 50-layer residual convolution neural network, the hole convolution can prevent the resolution of the feature representation of the image to be detected from being reduced too fast, and the feature representation of the image to be detected obtains enough receptive fields.

After the first convolution characteristic information is obtained by using the 50 layers of residual convolution neural networks, pooling calculation can be performed on the first convolution characteristic information by using the pyramid pooling module, and the obtained pooling characteristic is a multi-scale characteristic. After the multi-scale pooling features are obtained, cascading processing can be performed on the pooling features of each scale to obtain pooling feature information.

After the pooling characteristic information is obtained, the second convolution network can be used for carrying out convolution calculation on the pooling characteristic information to obtain first characteristic information of the image to be detected. If the second convolutional network comprises two convolutional layers (namely, a convolutional layer of 3x3 and a convolutional layer of 1x 1), the pooling feature information can be sequentially subjected to convolutional calculation by using the convolutional layer of 3x3 and the convolutional layer of 1x1, so as to obtain first feature information of the image to be detected.

In this embodiment, before extracting the feature information of the image to be detected by using the full convolution neural network after the two-dimensional CTC model training, the initial full convolution neural network may also be trained by using the two-dimensional CTC model to obtain the full convolution neural network described in step S204.

Prior to describing the training process for an initial full convolutional neural network, a traditional one-dimensional CTC is first described. In the traditional one-dimensional CTC model, epsilon' is introduced to describe the blank in the sequence and align the predicted sequence and the annotated sequence by filling in the blank and repeating in the two. The prediction sequence is a sequence predicted by the image and possibly is the character sequence. In the sequence shown in fig. 6, each row sequence is a predicted sequence. In the prediction sequence, the symbol "□" represents "e", and will not be described in the subsequent embodiments. As shown in fig. 6, the predicted sequences of

rows

1, 3 and 4 can be aligned correctly as the target sequence "FREE", and the predicted sequences of row two cannot be aligned as the target sequence "FREE". For a given position i in the prediction sequence, i can be skipped if and only if the prediction at i is e or the same as the previous prediction. For example, the first prediction sequence "F □ R E □ E" in fig. 6, assuming that i is the 2 nd character "□" in the prediction sequence, then the 2 nd character can be skipped when the alignment process is performed because the 2 nd character is "□" which represents E. For another example, if i is the 7 th character "E" in the prediction sequence "F □ R E □ E", then the 7 th character can be skipped since the 7 th character is the same as the 6 th character and is "E" when the alignment process is performed on the prediction sequence. Similarly, the 8 th character is the same as the 7 th character, and thus, the 8 th character can be skipped. Finally, the alignment result of the predicted sequence "F □ R E □ EE E" is "FREE". When all the positions which can be skipped in the prediction are removed, an aligned prediction sequence is obtained.

As described above, the CTC model measures the similarity of annotated and predicted sequences by calculating the conditional probability of annotation on the predicted distribution. By definition, this conditional probability is:

in particular, Y and X are the annotated sequence and the predicted distribution, A, respectively_X,YTo label all possible paths of the sequence Y under the predicted distribution X, t refers to the length of X. Dynamic planning may be used in embodiments of the present application to address this type of problem, since all possible paths are of a very large order of magnitude, and it is very inefficient to compute the probabilities of all paths and sum them up, recursively.

Firstly, the methodSince whether the target sequence is equivalent with e before and after each symbol, the target sequence Y is extended as follows to make the description clearer: y is^*＝[∈,y₁,∈,y₂,∈,…,y_L,∈]. Wherein, Y^*The target sequence is the target sequence after the extension, that is, an e is inserted before and after each symbol, and the target sequence Y with the original length L is extended to Y with the length of 2L + 1.

For a given s e [1,2, …,2L +1]Let Y1: s]The first s characters of Y, then define alpha_s,tIs Y1: s]The probability at time t, which represents the sum of the probabilities of all possible sub-paths reaching the s-th position of the sequence Y at time t.

Thus, for the case where the s-1 th symbol cannot be ignored, i.e., Y_s ^*E.g. case of or, α_s,tThe following formula is satisfied:

for other cases where the s-1 symbol cannot be ignored, i.e. if Y_s ^*Is not equal to and

then alpha is_s,tCan be calculated by the following formula:

wherein, Y_s ^*Represents the s-th character in the target sequence after the extension,

representing the s-2 th character in the target sequence after expansion.

To summarize, the dynamic programming state transition equation for the CTC model may be expressed as follows:

based on a traditional one-dimensional CTC model, embodiments provided herein extend the one-dimensional CTC model in the height dimension. Similarly, for a given two-dimensional distribution X', whose height information and width information are H and W, respectively, a path transition probability ψ ∈ R is defined^H×(W-1)×H. Probability psi of path transition_h,w,h'Represents the path transition probability from position (H, w) of the prediction distribution to position (H ', w +1), where H, H' ∈ [1,2, … H]，w∈[1,2,…,W-1]。

The description will be given by taking the two-dimensional spatial feature distribution shown in fig. 3 as an example. Fig. 3 shows a spatial feature distribution diagram with the size of Q × H × W, and in the case of any H × W sub-distribution diagram in fig. 3, the sub-distribution diagram is shown in fig. 4. Assuming that the coordinates (h, w) are the positions indicated by the symbols "1" in fig. 4, the coordinates (h', w +1) are the positions indicated by the symbols "2", "3", "4", and "5" in fig. 4.

Thereby, the method is easy to obtain,

the formula indicates that the sum of the path transition probabilities from a location in the predicted distribution to all elevations in the predicted distribution is 1. Therefore, as is clear from fig. 4, the sum of the path transition probabilities from the position indicated by the symbol "1" to the positions indicated by the symbols "2", "3", "4", and "5" is 1.

Similar to one-dimensional CTCs, the same extension of the target sequence resulted in the target sequence Y after extension. The state transition equations for the two-dimensional CTC model can then be derived using a similar derivation process:

in particular Ψ_j,w-1,hRepresenting the second path transition probability, representing the transition probability from a feature point (j, w-1) in the second two-dimensional spatial feature distribution to a feature point (h, w) in the second two-dimensional spatial feature distribution, j representing a height coordinate in the second two-dimensional spatial feature distribution, Y^*And X' represents the expanded labeled character sequence and the second two-dimensional space characteristic distribution respectively, and s represents Y^*A sequence number of a middle character, H represents another height coordinate in the second two-dimensional spatial feature distribution, w represents a width coordinate in the second two-dimensional spatial feature distribution, H e [1,2, … H],w∈[1,2,…,W-1]H represents height information in the second two-dimensional spatial feature distribution, and W represents width information in the second two-dimensional spatial feature distribution;

a probability of belonging to said second character distribution, indicating that the feature point at position (h, w) belongs to the character Y in the second literal sequence_s ^*The probability of (d); beta is a_s,h,wRepresents the sum of the probabilities of all sub-paths from the location (h, w) of the second two-dimensional spatial feature distribution to the character at the s-th location in the second literal sequence.

Finally, since the two-dimensional CTC model has H points in the height dimension of the two-dimensional spatial feature distribution as starting points, based on which the starting state of β can be defined as:

wherein, gamma is_h∈R^HAnd is and

R^Hrepresenting an H-dimensional vector over a real number domain.

By way of the above disclosure, a two-dimensional CTC model may be trained end-to-end on an initial fully convolutional neural network through sequence labeling. In the testing stage, the path with the maximum probability can be found in a similar manner to the one-dimensional CTC model, that is, by means of a greedy algorithm or a segment search, wherein the process of finding the path with the maximum probability is the process of finding the second character sequence.

Based on the above description, in the present embodiment, the process of training the initial full convolution neural network is described as follows:

step S301, acquiring a training sample image;

step S302, extracting the characteristic information of the training sample image through an initial full convolution neural network to obtain second characteristic information;

step S303, processing second characteristic information of the training sample image by using the two-dimensional CTC model to obtain a target loss function;

step S304, training the initial full convolution neural network through the target loss function to obtain the full convolution neural network.

Specifically, in this embodiment, when training the initial full convolution neural network, first a training sample image is obtained, and then the feature information of the training sample image is extracted through the initial full convolution neural network to obtain second feature information. The second characteristic information also includes: a second character distribution probability, a second path transition probability, and a second initial path probability.

The second character distribution probability is the probability that each feature point in the second two-dimensional space feature distribution of the training sample image belongs to a character in the second character sequence, and the second path transition probability represents the path selection probability in the high dimension in the second two-dimensional space feature distribution; the second initial path probability represents the probability that each feature point of the second two-dimensional spatial feature distribution is the starting feature point on the second path, and the second path is an effective path predicted in the second two-dimensional spatial feature distribution and capable of being aligned to the second text sequence.

After the second feature information is obtained as described above, the second feature information of the training sample image may be calculated by using the two-dimensional CTC model to obtain the target loss function. The initial fully convolutional neural network may then be trained through the target loss function, resulting in the fully convolutional neural network described in step S204.

In an alternative embodiment, the processing of the second feature information of the training sample image by using the two-dimensional CTC model to obtain the target loss function may specifically include the following steps:

firstly, processing the second characteristic information by using the two-dimensional CTC model to obtain the conditional probability of a second path; the second path is an effective path which is predicted by the initial full convolution neural network in the two-dimensional space feature distribution of the training sample image and can be aligned to a second character sequence in the training sample image.

When calculating the conditional probability of the second path, firstly, a target conditional probability β may be calculated by combining a dynamic programming algorithm and information in the second feature information_s,h,wWherein, β_s,h,wAnd the probability sum of all sub paths reaching the character at the s-th position in the second character sequence from the position (h, w) of a second two-dimensional space feature distribution, wherein the second two-dimensional space feature distribution is the space feature distribution of the training sample image.

Specifically, a target conditional probability β is calculated_s,h,wThe process of (a) can be described as: calculating the target conditional probability beta using a target formula_s,h,wThe target formula is expressed as:

obtaining a conditional probability beta_s,h,wThe target conditional probability β can then be used_s,h,wA conditional probability of the second path is calculated.

Wherein the content of the first and second substances,

Ψ_j,w-1,hrepresenting a second path transition probability, representing the transition probability from a feature point (j, w-1) in the second two-dimensional space feature distribution to a feature point (h, w) in the second two-dimensional space feature distribution, wherein j represents a height sequence number in the second two-dimensional space feature distribution, Y and X' respectively represent the marked character sequence after the second character sequence is expanded and the second two-dimensional space feature distribution, and s represents Y^*The serial number of the middle character, h represents another height coordinate in the second two-dimensional space characteristic distribution, and w represents a width coordinate in the second two-dimensional space characteristic distribution; h is equal to [1,2, … H ∈],w∈[1,2,…,W-1]H represents height information in the second two-dimensional spatial feature distribution, and W represents width information in the second two-dimensional spatial feature distribution;

It should be noted that, in this embodiment, the conditional probability of the second path may be calculated according to a formula, where the formula is expressed as:

wherein L is Y, and Y is Y^*And the vector used for characterizing the target sequence before expansion, and L is a numerical value of the vector Y after modular length.

After the conditional probability P (Y/X') of the second path is obtained in the above-described manner, the target loss function can be calculated as follows.

Then, the target Loss function Loss is determined based on the conditional probability of the second path. Wherein the formula is: loss ═ lnP (Y/X'). And the conditional probability of the second path is the target loss function.

As can be seen from the above description, in the present embodiment, a CTC model is combined to recognize a sequence in an image, and meanwhile, in order to solve the limitation of the conventional CTC model, the inventors have expanded the conventional CTC model and proposed a new two-dimensional CTC model to directly calculate the conditional probability of a target sequence from a two-dimensional probability distribution. More specifically, based on the traditional CTC model, the method provided by the application adds an altitude dimension in addition to a time dimension in searching a path, and the path search can be carried out between different altitudes. The selection of the search paths at different heights can still point to the same target sequence, and similarly, the sum of the conditional probabilities of all the paths is the conditional probability of the target sequence.

By expanding the traditional one-dimensional CTC model to two dimensions, the two-dimensional characteristics of the image can be retained based on the sequence identification of the image, and the similarity of the two-dimensional distribution direct calculation and labeling is directly calculated, so that the identification accuracy is greatly improved. In addition, due to the presence of two-dimensional information, this extension also provides the ability to handle curved, deflected, and perspectively deformed text. The two-dimensional CTC model in the application brings a new angle to the character recognition method, and the problem of sequence recognition based on images is processed in a more natural mode, so that the problem of two-dimensional distribution of the images is possibly reserved.

In addition, for the computation process of the CTC probability, the computation cost is very high in a computation mode of simply computing the probability of all paths and then summing the probabilities.

Example 3:

the embodiment of the present invention further provides a character recognition apparatus, which is mainly used for executing the character recognition method provided by the above-mentioned content of the embodiment of the present invention, and the following describes the character recognition apparatus provided by the embodiment of the present invention in detail.

Fig. 7 is a schematic diagram of a character recognition apparatus according to an embodiment of the present invention, as shown in fig. 7, the character recognition apparatus mainly includes an obtaining unit 10, an extracting unit 20, and a determining unit 30, where:

an acquisition unit 10 for acquiring an image to be detected;

the extraction unit 20 is configured to extract feature information of the image to be detected through a full convolution neural network after the two-dimensional CTC model training is adopted, so as to obtain first feature information;

wherein the first characteristic information includes at least one of: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is an initial feature point on a first path, and the first path is a path which is predicted in the first two-dimensional spatial feature distribution and can be aligned to a first character sequence;

a determining unit 30, configured to determine the first text sequence in the image to be detected by using the first feature information of the image to be detected.

In the embodiment of the invention, firstly, an image to be detected is obtained, and the feature information of the image to be detected is extracted by adopting a full convolution neural network after the two-dimensional CTC model training to obtain first feature information, wherein the first feature information comprises at least one of the following information: character distribution probability, path transition probability and initial path probability; and finally, determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected. According to the description, the full convolution neural network is trained by the two-dimensional CTC model, and the trained full convolution neural network is used for carrying out sequence recognition on the image to be detected, so that the recognition precision of the full convolution neural network can be improved, and the technical problem of low sequence prediction accuracy caused by attention deviation in the conventional image sequence recognition method is solved.

Optionally, the full convolution neural network comprises: the device comprises a first convolution network, a pyramid pooling module and a second convolution network.

Optionally, the first convolution network is a residual convolutional neural network, the residual convolutional neural network includes a plurality of convolution modules, and some of the convolution modules include a hole convolutional layer.

Optionally, the extraction unit 20 is configured to: performing feature extraction on the image to be detected by using the first convolution network to obtain first convolution feature information; performing pooling calculation on the first convolution characteristic information by using the pyramid pooling module to obtain pooling characteristics of different scales, and performing cascade processing on the pooling characteristics of different scales to obtain pooling characteristic information; and carrying out convolution calculation on the pooled feature information by utilizing the second convolution network to obtain first feature information of the image to be detected.

Optionally, the apparatus is further configured to: acquiring a training sample image; extracting feature information of the training sample image through an initial full convolution neural network to obtain second feature information; the second characteristic information includes at least one of: a second character distribution probability, a second path transition probability and a second initial path probability, wherein the second character distribution probability is the probability that each feature point in a second two-dimensional space feature distribution of the training sample image belongs to a character in a second character sequence, and the second path transition probability represents the path selection probability in a high dimension in the second two-dimensional space feature distribution; the second initial path probability represents the probability that each feature point of the second two-dimensional spatial feature distribution is an initial feature point on a second path, and the second path is an effective path predicted in the second two-dimensional spatial feature distribution and capable of being aligned to a second character sequence; processing second characteristic information of the training sample image by using the two-dimensional CTC model to obtain a target loss function; and training the initial full convolution neural network through the target loss function to obtain the full convolution neural network.

Optionally, the apparatus is further configured to: processing the second characteristic information by using the two-dimensional CTC model to obtain the conditional probability of a second path; determining the objective loss function based on the conditional probability of the second path.

Optionally, the apparatus is further configured to: combining a dynamic programming algorithm and the second characteristic information to calculate and obtain a target conditional probability beta_s,h,wWherein, β_s,h,wRepresenting the sum of the probabilities of all sub-paths from the position (h, w) of a second two-dimensional spatial feature distribution to the character at the s-th position in the second text sequence, wherein the second two-dimensional spatial feature distribution is the spatial feature distribution of the training sample image; using the target conditional probability beta_s,h,wA conditional probability of the second path is calculated.

Optionally, the apparatus is further configured to: calculating the target conditional probability beta using a target formula_s,h,wThe target formula is expressed as:

wherein the content of the first and second substances,

Ψ_j,w-1,hrepresenting a second path transition probability representing a transition probability from a feature point (j, w-1) in the second two-dimensional spatial feature distribution to a feature point (h, w) in the second two-dimensional spatial feature distribution, j representing a height index in the second two-dimensional spatial feature distribution, Y^*And X' represents the expanded labeled character sequence and the second two-dimensional space characteristic distribution respectively, and s represents Y^*A sequence number of a middle character, H represents another height coordinate in the second two-dimensional spatial feature distribution, w represents a width coordinate in the second two-dimensional spatial feature distribution, H e [1,2, … H],w∈[1,2,…,W-1]H represents height information in the second two-dimensional spatial feature distribution, and W represents width information in the second two-dimensional spatial feature distribution;

a probability of belonging to said second character distribution, indicating that the feature point at position (h, w) belongs to the character Y in the second literal sequence_s ^*The probability of (d); Ψ_j,0,hIs based on said second initial path probability Ψ_j,-1,hAnd (4) calculating.

Optionally, the apparatus is further configured to: determining the target Loss function using the formula lose-lnP (Y/X '), where P (Y/X') is the conditional probability of the second path and lose is the target Loss function.

The present application further provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, performs the steps of the method of any of the above-mentioned method embodiments.

The device provided by the embodiment has the same implementation principle and technical effect as the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiment for the portion of the embodiment of the device that is not mentioned.

Furthermore, the present embodiment provides a processing device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the gesture recognition method provided by the above embodiment is implemented.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing embodiments, and is not described herein again.

In the text recognition method, the text recognition device, the electronic device, and the computer-readable storage medium storing the program code according to the embodiments of the present invention, instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and are not to be limited by alignment, and the scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still modify or easily conceive of the technical solutions described in the foregoing embodiments or make equivalent substitutions of some technical features in the alignment, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for recognizing a character, comprising:

acquiring an image to be detected, and extracting characteristic information of the image to be detected through a full convolution neural network after a two-dimensional CTC model is trained to obtain first characteristic information;

wherein the first feature information includes: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is an initial feature point on the first path; the first path is a predicted path which can be aligned to a first text sequence in a first two-dimensional space characteristic distribution;

and determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected.

2. The method of claim 1, wherein the fully convolutional neural network comprises: the device comprises a first convolution network, a pyramid pooling module and a second convolution network.

3. The method of claim 2, wherein the first convolution network is a residual convolution neural network including a plurality of convolution modules, and wherein some of the convolution modules include hole convolution layers.

4. The method of claim 2 or 3, wherein extracting the feature information of the image to be detected by using a full convolution neural network after two-dimensional CTC model training to obtain first feature information comprises:

performing feature extraction on the image to be detected by using the first convolution network to obtain first convolution feature information;

performing pooling calculation on the first convolution characteristic information by using the pyramid pooling module to obtain pooling characteristics of different scales, and performing cascade processing on the pooling characteristics of different scales to obtain pooling characteristic information;

and carrying out convolution calculation on the pooled feature information by utilizing the second convolution network to obtain first feature information of the image to be detected.

5. The method of claim 1, further comprising:

acquiring a training sample image;

extracting feature information of the training sample image through an initial full convolution neural network to obtain second feature information; the second characteristic information includes at least one of: a second character distribution probability, a second path transition probability and a second initial path probability, wherein the second character distribution probability is the probability that each feature point in a second two-dimensional space feature distribution of the training sample image belongs to a character in a second character sequence, and the second path transition probability represents the path selection probability in a high dimension in the second two-dimensional space feature distribution; the second initial path probability represents the probability that each feature point of the second two-dimensional spatial feature distribution is an initial feature point on a second path, and the second path is an effective path predicted in the second two-dimensional spatial feature distribution and capable of being aligned to a second character sequence;

processing second characteristic information of the training sample image by using the two-dimensional CTC model to obtain a target loss function;

and training the initial full convolution neural network through the target loss function to obtain the full convolution neural network.

6. The method of claim 5, wherein processing second feature information of the training sample images using the two-dimensional CTC model to obtain an objective loss function comprises:

processing the second characteristic information by using the two-dimensional CTC model to obtain the conditional probability of a second path;

determining the objective loss function based on the conditional probability of the second path.

7. The method of claim 6, wherein processing the second feature information using the two-dimensional CTC model to obtain a conditional probability of a second path comprises:

combining the dynamic programming algorithm and the second characteristic information to obtain the target through calculationStandard conditional probability beta_s，，wWherein, β_s，，wRepresenting the sum of the probabilities of all sub-paths from the position (h, w) of a second two-dimensional spatial feature distribution to the character at the s-th position in the second text sequence, wherein the second two-dimensional spatial feature distribution is the spatial feature distribution of the training sample image;

using the target conditional probability beta_s，，wA conditional probability of the second path is calculated.

8. The method of claim 7, wherein calculating the target conditional probability in combination with a dynamic programming algorithm and the second feature information comprises:

calculating the target conditional probability beta using a target formula_s，h，wThe target formula is expressed as:

wherein the content of the first and second substances,

Ψ_j，w-1，hrepresenting the second path transition probability representing a transition probability from a feature point (j, w-1) in the second two-dimensional spatial feature distribution to a feature point (h, w) in the second two-dimensional spatial feature distribution; j represents a height coordinate in said second two-dimensional spatial feature distribution, Y^*And X' represents the expanded labeled character sequence and the second two-dimensional space characteristic distribution respectively, and s represents Y^*The sequence number of the middle character, h represents the secondAnother height coordinate in a two-dimensional spatial feature distribution, w representing a width coordinate in the second two-dimensional spatial feature distribution; h is an element of [1,2]，w∈[1，2，...，W-1]H represents height information in the second two-dimensional spatial feature distribution, and W represents width information in the second two-dimensional spatial feature distribution;

a probability of belonging to the second character distribution, representing a probability that the feature point at position (h, w) belongs to a character in a second literal sequence; Ψ_j，0，hIs based on said second initial path probability Ψ_j，-1And calculating.

9. The method of claim 6, wherein determining the objective loss function based on the conditional probability of the second path comprises:

determining the target Loss function using the formula Loss ═ -ln P (Y/X '), where P (Y/X') is the conditional probability of the second path and Loss is the target Loss function.

10. A character recognition apparatus, comprising:

the acquisition unit is used for acquiring an image to be detected;

the extraction unit is used for extracting the characteristic information of the image to be detected through a full convolution neural network after the two-dimensional CTC model training to obtain first characteristic information;

wherein the first feature information includes: a first character distribution probability, a first path transition probability and a first initial path probability; the first character distribution probability is the probability that each feature point in the first two-dimensional space feature distribution of the image to be detected belongs to a first character sequence, and the first path transition probability represents the path selection probability in the high dimension in the first two-dimensional space feature distribution; the first initial path probability represents the probability that each feature point of the first two-dimensional spatial feature distribution is an initial feature point on a first path, and the first path is a path which is predicted in the first two-dimensional spatial feature distribution and can be aligned to a first character sequence;

and the determining unit is used for determining the first character sequence in the image to be detected by utilizing the first characteristic information of the image to be detected.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of the preceding claims 1 to 9 are implemented when the computer program is executed by the processor.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of the preceding claims 1 to 9.