CN111666931B

CN111666931B - Mixed convolution text image recognition method, device, equipment and storage medium

Info

Publication number: CN111666931B
Application number: CN202010436538.2A
Authority: CN
Inventors: 何胜; 喻宁; 冯晶凌; 柳阳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2024-05-28
Anticipated expiration: 2040-05-21
Also published as: CN111666931A

Abstract

The invention relates to artificial intelligence, and provides a hybrid convolution text-based image recognition method, a device, computer equipment and a storage medium, wherein an image to be recognized is obtained; acquiring a character image recognition model containing a mixed convolution kernel; inputting the image to be identified into a character image identification model to obtain a convolution feature diagram corresponding to the image to be identified; inputting the convolution feature map into a cyclic neural network of a character image recognition model to perform feature extraction to obtain sequence features; inputting the sequence features into a full-connection layer of a character image recognition model to obtain a character probability distribution result; and carrying out error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized. The mixed convolution kernel is adopted in the character image recognition model to carry out convolution feature extraction, so that the character recognition accuracy is improved. And the adjacent character areas are fully learned by adopting a mixed convolution kernel mode, so that the accuracy and coverage rate of character recognition are improved.

Description

Mixed convolution text image recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of text recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing text images based on hybrid convolution.

Background

With the development of artificial intelligence technology, the fields of natural language processing (such as text detection, text recognition, speech recognition or text translation recognition) are also rapidly developed. The text detection and text recognition are widely applied to the scenes such as guideboard detection and recognition, scene text translation or book text recognition and extraction.

In the prior art, most of the methods of text detection and text recognition are implemented by convolutional neural networks. However, in the prior art, when a convolutional neural network identifies some special fonts, such as artistic fonts, handwritten fonts or irregular fonts, errors are very easy to occur, and a general convolutional neural network can only focus local characters in a smaller coverage area, so that an identification range is smaller, and the identification efficiency is further reduced.

Disclosure of Invention

The embodiment of the invention provides a hybrid convolution-based character image recognition method, a device, equipment and a storage medium, which are used for solving the problem of high character recognition error.

A method for recognizing a character image based on mixed convolution comprises the following steps:

acquiring an image to be identified, wherein the image to be identified contains at least one character;

acquiring a character image recognition model comprising a mixed convolution kernel, wherein the mixed convolution kernel comprises 3*3 convolution kernels, 5*5 convolution kernels and 7*7 convolution kernels;

Inputting the image to be identified into the character image identification model to obtain a convolution feature diagram corresponding to the image to be identified;

inputting the convolution feature map into a cyclic neural network of the character image recognition model for feature extraction to obtain sequence features;

inputting the sequence features to a full-connection layer of the character image recognition model to obtain a character probability distribution result;

and carrying out error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

A hybrid convolution-based text image recognition device, comprising:

the image acquisition module is used for acquiring an image to be identified, wherein the image to be identified contains at least one character;

The system comprises a model acquisition module, a data processing module and a data processing module, wherein the model acquisition module is used for acquiring a character image recognition model comprising a mixed convolution kernel, and the mixed convolution kernel comprises 3*3 convolution kernels, 5*5 convolution kernels and 7*7 convolution kernels;

The convolution identification module is used for inputting the image to be identified into the character image identification model to obtain a convolution feature diagram corresponding to the image to be identified;

the feature extraction module is used for inputting the convolution feature map into a cyclic neural network of the character image recognition model to perform feature extraction so as to obtain sequence features;

the character distribution module is used for inputting the sequence features to the full-connection layer of the character image recognition model to obtain a character probability distribution result;

And the loss calculation module is used for calculating the error loss of the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the hybrid convolutional text image based recognition method described above when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the hybrid convolution based text image recognition method described above.

According to the hybrid convolution text image recognition method, device, equipment and storage medium, the image to be recognized is obtained, and the image to be recognized contains at least one character; acquiring a character image recognition model containing a mixed convolution kernel; the hybrid convolution kernels include 3*3 convolution kernels, 5*5 convolution kernels, and 7*7 convolution kernels; inputting the image to be identified into a character image identification model to obtain a convolution feature diagram corresponding to the image to be identified; inputting the convolution feature map into a cyclic neural network of a character image recognition model to perform feature extraction to obtain sequence features; inputting the sequence features into a full-connection layer of a character image recognition model to obtain a character probability distribution result; and carrying out error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized. Through the steps, the mixed convolution kernel is adopted in the character image recognition model to carry out convolution feature extraction, the traditional mode of carrying out convolution feature extraction by using only a single type of convolution kernel only focuses on local characters in a region with smaller coverage rate, and when the image to be recognized is an artistic font or a handwritten font, the problem of small coverage rate can occur, so that the character recognition accuracy is low. The method of the mixed convolution kernel is adopted, the characteristic of large visual field of the large-specification convolution kernel is utilized, the adjacent character areas can be fully learned, and the accuracy and coverage rate of character recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment based on a hybrid convolution text image recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a hybrid convolution based text image recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart of step S13 in a hybrid convolution based text image recognition method according to an embodiment of the present invention;

FIG. 4 is a flowchart of step S14 in a hybrid convolution based text image recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of an apparatus for recognizing images based on mixed convolution words in accordance with an embodiment of the present invention;

FIG. 6 is a schematic block diagram of a convolutional recognition module in a hybrid convolutional text-based image recognition device in accordance with an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a feature extraction module in a hybrid convolution-based text image recognition device according to one embodiment of the present invention;

FIG. 8 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The embodiment of the invention provides a mixed convolution text image based identification method which can be applied to an application environment shown in fig. 1. Specifically, the hybrid convolution-based text image recognition method is applied to a hybrid convolution-based text image recognition system, the hybrid convolution-based text image recognition system comprises a client and a server as shown in fig. 1, and the client and the server are communicated through a network and are used for data island problems. The client is also called a client, and refers to a program corresponding to the server for providing local service for the client. The client may be installed on, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a hybrid convolution text image based recognition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

s11: and acquiring an image to be recognized, wherein the image to be recognized contains at least one character.

S12: acquiring a character image recognition model containing a mixed convolution kernel; the hybrid convolution kernels include 3*3 convolution kernels, 5*5 convolution kernels, and 7*7 convolution kernels.

The image to be recognized is an image which is to be recognized and contains at least one character. The mixed convolution kernel includes a plurality of convolution kernels, and specifically, the mixed convolution kernel may include 3*3 convolution kernels, 5*5 convolution kernels, 7*7 convolution kernels, 11×11 convolution kernels, or the like. The character image recognition model is a model for recognizing an image containing characters, and a mixed convolution kernel is adopted as a convolution kernel of the character image recognition model.

Specifically, an original image is obtained, character detection is carried out on the original image, and if the original image contains at least one character, the original image is used as an image to be identified; after the image to be identified is acquired, a character image identification model is acquired, and a convolution kernel of the character image identification model adopts a mixed convolution kernel, wherein the mixed convolution kernel comprises 3*3 convolution kernels, 5*5 convolution kernels, 7*7 convolution kernels and the like.

S13: and inputting the image to be identified into a character image identification model to obtain a convolution characteristic diagram corresponding to the image to be identified.

The convolution feature map is a feature map containing character features in an image to be identified.

Specifically, after an image to be recognized and a character image recognition model are obtained, the image to be recognized is input into the character image recognition model, convolution feature extraction is respectively carried out on the image to be recognized by adopting mixed convolution kernels of the character image recognition model, the images after the convolution feature extraction are combined, and finally feature fusion is carried out on the combined images, so that a convolution feature diagram corresponding to the image to be recognized is obtained.

Further, the characteristic of large receptive field of the large-specification convolution kernel can be effectively utilized by adopting the mixed convolution kernel, and the adjacent character areas can be fully learned, not just the focused local area. For example, assuming that the original width of the image to be identified is 800 in the convolution, the width is reduced from 800 to 200 by 4 times in the process of performing convolution feature extraction. If the feature extraction is continued by using the conventional single-kernel 3*3 convolution kernel at this time, the area of the receptive field of the 3*3 convolution kernel is 12×12,12 pixels may not cover adjacent characters at the same time, and the area of the receptive field of the 7*7 convolution kernel is 28×28, so that a sufficiently large area can be perceived in most cases, and more perception information can be obtained.

The method for increasing the receptive field mentioned above can also be implemented by using a multi-layer convolution layer or pooling layer technique with a step size greater than 1. However, too many technical means of using a multi-layer convolution layer or a pooling layer with a step length larger than 1 can cause too small width of the convolution feature map obtained by extraction, and influence the feature information extraction and prediction of the cyclic neural network on each position in the subsequent steps.

S14: and inputting the convolution feature map into a cyclic neural network of a character image recognition model to perform feature extraction, so as to obtain sequence features.

The cyclic neural network is used for extracting the features of the sequence front-back relation of the convolution feature map. The sequence feature is a feature map having a sequence context feature.

Specifically, after an image to be identified is input into a character image identification model to obtain a convolution feature map corresponding to the image to be identified, the convolution feature map is input into a cyclic neural network, and feature extraction of a sequence front-back relation is carried out on the convolution feature map to obtain sequence features.

Alternatively, the structure of the recurrent neural network may be RNN, LSTM, GRU or BiLSTM structures. Preferably, the recurrent neural network in the above embodiment adopts BiLSTM structure. The BiLSTM structure is formed by combining a forward LSTM and a backward LSTM, and the BiLSTM structure can better capture the two-way semantic dependence and avoid causing semantic errors.

S15: and inputting the sequence features into a full-connection layer of the character image recognition model to obtain a character probability distribution result.

The character probability distribution result is the distribution probability of the character at each position.

After the convolution feature map is input into a cyclic neural network to perform feature extraction to obtain sequence features, the obtained sequence features are input into a full-connection layer of a character image recognition model to obtain the distribution probability of each character in an image to be recognized at any position, and a character probability distribution result is obtained.

S16: and carrying out error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

The preset loss function is used for measuring the degree of difference between input data and data actually output after the input data passes through the neural network. Character recognition results the character recognition results in the image to be recognized.

Specifically, after the sequence features are input to a full-connection layer of a character image recognition model to obtain a character probability distribution result, error loss calculation is carried out on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of an image to be recognized.

Alternatively, the preset loss function may be a CTC function, an Attention function, or the like. Preferably, CTC functions are employed in the above embodiments. For example, if the characters in the image to be recognized are handwritten, the image to be recognized generated at the time of handwriting tends to cause the characters to be concatenated together or the character distinction to be too far apart. For example, when the word to is handwritten, a person connecting the word to is able to connect the word to be written, so that a result of "t-o" may be obtained, or a position occupied by a certain character is larger during handwriting, so that the position probability distribution identifies a plurality of identical characters, for example, "to" is identified as "ttooo", and if the word to be written is not processed by adopting a loss function at this time, the obtained character identification result is "t-o"; the adoption of the CTC function firstly encodes the obtained character probability distribution result, then carries out loss calculation on the encoded result, and finally decodes the result subjected to the loss calculation to obtain the final character recognition result of the image to be recognized, wherein the character recognition result is 'to', and therefore the error rate of the character recognition result is not improved due to the non-standardization of the handwriting font.

In this embodiment, the mixed convolution kernel is adopted in the text image recognition model to perform convolution feature extraction, and conventionally, only a single type of convolution kernel is used to perform convolution feature extraction, so that only local text in a region with smaller coverage rate is focused, and when an artistic font or a handwritten font is in an image to be recognized, the problem of small coverage rate occurs, so that the text recognition accuracy is low. The method of the mixed convolution kernel is adopted, the characteristic of large visual field of the large-specification convolution kernel is utilized, the adjacent character areas can be fully learned, and the accuracy and coverage rate of character recognition can be improved.

In one embodiment, as shown in fig. 3, in step S13, an image to be identified is input into a text image identification model to obtain a convolution feature map corresponding to the image to be identified, which specifically includes the following steps:

S131: and inputting the image to be identified into a character image identification model, and adopting a mixed convolution kernel to carry out convolution feature extraction on the image to be identified to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernels.

Specifically, after an image to be recognized and a character image recognition model are obtained, inputting the image to be recognized into the character image recognition model; the text image recognition model comprises 3*3 convolution kernels, 5*5 convolution kernels and 7*7 convolution kernels, and 3*3 convolution kernels, 5*5 convolution kernels and 7*7 convolution kernels are adopted to respectively carry out convolution feature extraction on images to be recognized, so that a feature map corresponding to the 3*3 convolution kernels, a feature map corresponding to the 5*5 convolution kernels and a feature map corresponding to the 7*7 convolution kernels are obtained.

For example, assuming that the text image recognition model includes 32 3*3 convolution kernels, 32 5*5 convolution kernels and 32 7*7 convolution kernels, the above convolution kernels are adopted to respectively perform convolution feature extraction on the images to be recognized, so as to obtain 3 feature maps of 40×40×32.

S132: and combining the feature graphs corresponding to the convolution kernels of each class to obtain a combined feature graph.

The combined feature map is a feature map generated by combining a plurality of feature maps, and the feature map is often obtained by combining convolution feature maps generated by a plurality of different convolution kernels.

Specifically, after the image to be identified is subjected to convolution feature extraction by adopting a mixed convolution kernel to obtain feature images corresponding to each type of convolution kernels in the mixed convolution kernels, combining the feature images corresponding to the 3*3 convolution kernels, the feature images corresponding to the 5*5 convolution kernels and the feature images corresponding to the 7*7 convolution kernels to obtain combined feature images.

For example, assuming that 32 3*3 convolution kernels, 32 5*5 convolution kernels and 32 7*7 convolution kernels are adopted to perform convolution feature extraction on an image to be identified, after 3 feature maps of 40×40×32 are obtained, the feature maps of 3 features of 40×40×32 are combined into a combined feature map of 40×40×96.

S133: and carrying out feature fusion on the combined feature images by adopting 1*1 convolution cores to obtain a convolution feature image.

The convolution feature map is obtained by carrying out convolution feature extraction, merging and fusion on the images.

Specifically, after merging feature images corresponding to each type of convolution kernel to obtain a merged feature image, carrying out feature fusion on the merged feature image by adopting 1*1 convolution cores to obtain a convolution feature image,

Illustratively, it is assumed that after merging 3 feature maps of 40×40×32 into a merged feature map of 40×40×96, the extracted features are further merged on the merged feature map of 40×40×96 using 64 1*1 convolution kernels, resulting in a convolution feature map of 40×40×64.

In this embodiment, the mixed convolution kernels are adopted to perform convolution feature extraction on the image to be recognized, so that when the characters in the image to be recognized are distorted, the characters are artistic fonts or the characters are handwritten fonts, the 5*5 convolution kernels or the 7*7 convolution kernels can be utilized, the text information on two adjacent sides can be better perceived, and the accuracy of text recognition can be improved.

In one embodiment, as shown in fig. 4, in step S14, that is, the convolution feature map is input into a recurrent neural network of a text image recognition model to perform feature extraction, so as to obtain a sequence feature, the method further includes the following steps:

S141: and performing feature conversion on the convolution feature graphs with the heights of 4, 2 and 1 respectively to obtain a first transposition feature corresponding to the convolution feature graph with the height of 4, a second transposition feature corresponding to the convolution feature graph with the height of 2 and a third transposition feature corresponding to the convolution feature graph with the height of 1.

The first transposition feature, the second transposition feature and the third transposition feature are all obtained by performing feature conversion on the original convolution feature map, and the first transposition feature, the second transposition feature and the third transposition feature are essentially sequence-type features.

Specifically, in the process of performing convolution feature extraction on an image to be identified by a text image identification model, namely, in the process of reducing the height of the image to be identified by convolution, feature conversion is performed on convolution feature graphs with heights of 4, 2 and 1 respectively, so as to obtain a first transposition feature corresponding to the convolution feature graph with the height of 4, a second transposition feature corresponding to the convolution feature graph with the height of 2 and a third transposition feature corresponding to the convolution feature graph with the height of 1.

In an exemplary embodiment, in the process of inputting an image to be identified into the text image identification model and performing convolution feature extraction on the image to be identified, it is assumed that a currently extracted intermediate feature image is 8×200×c, and at this time, in the process of reducing the heights of convolution feature images by subsequent convolution, when the heights of the convolution feature images are respectively 4, 2 and 1, feature conversion is performed on the convolution feature image corresponding to the height of 4, so as to obtain a first conversion feature of 200×4×c; performing feature conversion on the convolution feature map with the height of 2 to obtain a second transposed feature of 200 x 2 x C; and performing feature conversion on the convolution feature map with the height of 1 to obtain a third transposed feature of 200 x 1 x C.

In 8×200×c, 8 denotes the height of an image, 200 denotes the width of an image, and C denotes the number of characteristic channels of an image.

S142: and inputting the first transposition feature, the second transposition feature and the third transposition feature into the cyclic neural network, and performing global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature.

The recurrent neural network is used for recursing the sequence data input in the evolution direction of the sequence and all nodes are connected in a chained mode. The cyclic neural network is used for extracting the characteristics of the sequence front-back relation of the convolution characteristic diagram.

Specifically, after the first transposition feature, the second transposition feature and the third transposition feature are obtained, the first transposition feature, the second transposition feature and the third transposition feature are input into a cyclic neural network, global feature extraction is performed on the first transposition feature, the second transposition feature and the third transposition feature, and a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature are obtained.

Optionally, the structure of the recurrent neural network includes RNN, LSTM, biLSTM and GRU structures. Preferably, a BiLSTM-structured recurrent neural network is employed in this embodiment. For example, after obtaining the first transposed feature of 200×4×c, the second transposed feature of 200×2×c, and the third transposed feature of 200×1×c, global feature extraction is performed on the first transposed feature, the second transposed feature, and the third transposed feature by using a BiLSTM-structured cyclic neural network, so as to obtain 3200×h sequence features, i.e., the first sequence feature, the second sequence feature, and the third sequence feature are all 200×h.

In 200×h, H represents the size of the hidden layer in BiLSTM structures.

S143: and merging the sequence features of the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence features.

The sequence features are feature graphs with sequence front-back relation features.

Specifically, after the first sequence feature, the second sequence feature and the third sequence feature are obtained, the first sequence feature, the second sequence feature and the third sequence feature are combined in sequence to obtain the sequence feature.

For example, it is assumed that, after global feature extraction is performed on the first transposed feature, the second transposed feature, and the third transposed feature by using a cyclic neural network with BiLSTM structures, to obtain 3 sequence features of 200×h, sequence features of the first sequence feature, the second sequence feature, and the third sequence feature are combined to obtain sequence features of 200×3h.

In this embodiment, feature conversion is performed on the convolution feature graphs with heights of 4, 2 and 1, so that a multi-scale feature graph is adopted for supervision in the feature conversion process, instead of directly convolving the convolution kernel to the last scale for judgment. The information transmission capability in the convolution process is increased, and the robustness of the sequence characteristics is improved.

In one embodiment, before step S13, that is, before inputting the image to be recognized into the text image recognition model, the method further includes the following steps:

And determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized, and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length.

The convolution step length is the length of the convolution kernel movement.

Specifically, after the image to be recognized and the text image processing model are obtained, determining the maximum length of the actual sequence of the image to be recognized according to the width of the image to be recognized, and determining the maximum sequence length corresponding to the image to be recognized according to the maximum length of the actual sequence of the image to be recognized; after determining the maximum sequence length, determining the convolution step length of the mixed convolution kernel in the character image recognition model according to the maximum sequence length.

Illustratively, assuming that the width of the image to be recognized is 800 and the actual sequence maximum length is 96, the maximum sequence length is 200 according to the actual sequence maximum length. Because 800/200 is an integer multiple of 2, in the process of inputting the image to be recognized into the character image recognition model for convolution feature extraction, convolution operation with the step length of 2 is only needed to be carried out twice, so that a regular feature map is obtained.

Inputting an image to be identified into a character image identification model, adopting a mixed convolution kernel to carry out convolution feature extraction on the image to be identified, and obtaining feature images corresponding to each type of convolution kernels in the mixed convolution kernels, wherein the method comprises the following steps:

after the image to be identified is input into the character image identification model, various convolution kernels in the mixed convolution kernels are moved according to the convolution step length, and the convolution feature extraction is carried out on the image to be identified through each moved convolution kernel, so that feature images corresponding to the various convolution kernels are obtained.

Specifically, after determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length, inputting the image to be recognized into the character image recognition model, moving various convolution kernels in the mixed convolution kernel of the character image recognition model according to the convolution step length, and respectively carrying out convolution feature extraction on the image to be recognized by adopting each convolution kernel in the mixed convolution kernel in the moving process to obtain a feature map corresponding to the various convolution kernels.

In the present embodiment, the parameter for the maximum sequence length is conventionally arbitrarily set. The method is determined according to the width of the image to be recognized, and the maximum sequence parameter with regularity is used, so that more regular information can be obtained during convolutional feature extraction, and the accuracy of character recognition is improved.

In one embodiment, before step S12, that is, before the text-image recognition model including the mixed convolution kernel is acquired, the following steps are further included:

a training sample set is obtained, wherein the training sample set comprises character sample images, and each character sample image comprises at least one character.

An initial literal image recognition model is obtained that includes a hybrid convolution kernel that includes 3*3, 5*5, and 7*7 convolution kernels.

The training sample set is a set of text sample images to be trained. The text sample image is a text image to be trained, and the text sample image contains at least one character. The initial text image recognition model is a basic model for recognizing an image containing characters.

Specifically, a plurality of character sample images under different scenes are obtained, character detection is carried out on all the character sample images, and if the character sample images contain at least one character, the character sample images are stored in a training sample set. After the training sample set is determined, an initial text image recognition model is obtained, and a mixed convolution kernel is adopted in the convolution kernel of the initial text image recognition model, wherein the mixed convolution kernel comprises 3*3 convolution kernels, 5*5 convolution kernels, 7*7 convolution kernels and the like.

And inputting the training sample set into the initial character image recognition model for training to obtain the character image recognition model.

The character image recognition model is a model obtained by training according to a training sample set based on the initial character image recognition model.

The text image recognition model building process comprises the following steps: inputting the training sample set into an initial text image recognition model; defining a mixed convolution kernel moving step length in an initial character image recognition model according to character sample images in a training sample set; defining the structures of a convolutional neural network, a cyclic neural network and a full-connection layer in an initial text image recognition model, defining a forward propagation process, and performing iterative training on the initial text image recognition model by using the defined forward propagation process to obtain a trained model; and testing and verifying the trained model to obtain the character image recognition model.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a hybrid-convolution-based text image recognition device is provided, where the hybrid-convolution-based text image recognition device corresponds to the hybrid-convolution-based text image recognition method in the above embodiment. As shown in fig. 5, the hybrid convolution based text image recognition device includes an image acquisition module 11, a model acquisition module 12, a convolution recognition module 13, a feature extraction module 14, a character distribution module 15, and a loss calculation module 16. The functional modules are described in detail as follows:

the image acquisition module 11 is configured to acquire an image to be identified, where the image to be identified includes at least one character.

The model acquisition module 12 is configured to acquire a text image recognition model that includes a hybrid convolution kernel that includes 3*3 convolution kernels, 5*5 convolution kernels, and 7*7 convolution kernels.

The convolution identification module 13 is configured to input an image to be identified into the text image identification model, and obtain a convolution feature map corresponding to the image to be identified.

The feature extraction module 14 is configured to input the convolution feature map to a recurrent neural network of the text image recognition model for feature extraction, so as to obtain a sequence feature.

And the character distribution module 15 is used for inputting the sequence features into the full-connection layer of the character image recognition model to obtain a character probability distribution result.

The loss calculation module 16 is configured to perform error loss calculation on the character probability distribution result by using a preset loss function, so as to obtain a character recognition result of the image to be recognized.

Preferably, as shown in fig. 6, the convolution identification module 13 includes the following units:

the convolution feature extraction unit 131 is configured to input an image to be identified into the text image identification model, perform convolution feature extraction on the image to be identified by adopting a mixed convolution kernel, and obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel.

The feature merging unit 132 is configured to merge feature graphs corresponding to each type of convolution kernel, so as to obtain a merged feature graph.

And the feature fusion unit 133 is configured to perform feature fusion on the combined feature map by using 1*1 convolution cores to obtain a convolution feature map.

Preferably, as shown in fig. 7, the feature extraction module 14 includes the following units:

The feature conversion unit 141 is configured to perform feature conversion on the convolution feature maps with the heights of 4, 2, and 1, respectively, to obtain a first transpose feature corresponding to the convolution feature map with the height of 4, a second transpose feature corresponding to the convolution feature map with the height of 2, and a third transpose feature corresponding to the convolution feature map with the height of 1.

The global feature extraction unit 142 is configured to input the first transposed feature, the second transposed feature, and the third transposed feature into the recurrent neural network, and perform global feature extraction on the first transposed feature, the second transposed feature, and the third transposed feature to obtain a first sequence feature corresponding to the first transposed feature, a second sequence feature corresponding to the second transposed feature, and a third sequence feature corresponding to the third transposed feature.

And a sequence feature merging unit 143, configured to perform sequence feature merging on the first sequence feature, the second sequence feature, and the third sequence feature, so as to obtain a sequence feature.

Preferably, the hybrid convolution-based text image recognition device further comprises the following modules:

The convolution step length determining module is used for determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized, and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length.

the training sample set acquisition module is used for acquiring a training sample set, wherein the training sample set comprises character sample images, and each character sample image contains at least one character.

The initial model acquisition module is used for acquiring an initial text image identification model comprising a mixed convolution kernel, wherein the mixed convolution kernel comprises 3*3 convolution kernels, 5*5 convolution kernels and 7*7 convolution kernels.

And the model training module is used for inputting the training sample set into the initial character image recognition model for training to obtain the character image recognition model.

For specific limitations of the hybrid-convolution-based text image recognition device, reference may be made to the above limitations of the hybrid-convolution-based text image recognition method, and no further description is given here. The above-mentioned various modules in the hybrid convolution based text image recognition device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing the data used in the mixed convolution-based character image recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a hybrid convolution based text image recognition method.

In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the hybrid convolution-based text image recognition method of the above embodiments.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the data processing model generation method in the above embodiment, or which when executed by a processor implements the hybrid convolution-based text image recognition method in the above embodiment.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The mixed convolution text image based identification method is characterized by comprising the following steps:

inputting the image to be identified into the character image identification model, adopting mixed convolution kernels of the character image identification model to respectively carry out convolution feature extraction on the image to be identified, merging the images after the convolution feature extraction on each convolution kernel, and finally carrying out feature fusion on the merged image to obtain a convolution feature image corresponding to the image to be identified;

Performing error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized;

The step of inputting the convolution feature map into the cyclic neural network of the character image recognition model for feature extraction to obtain sequence features, comprising the following steps:

Performing feature conversion on the convolution feature graphs with the heights of 4, 2 and 1 respectively to obtain a first transposition feature corresponding to the convolution feature graph with the height of 4, a second transposition feature corresponding to the convolution feature graph with the height of 2 and a third transposition feature corresponding to the convolution feature graph with the height of 1;

Inputting the first transposition feature, the second transposition feature and the third transposition feature into the recurrent neural network, and performing global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature;

and merging the sequence features of the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence feature.

2. The hybrid convolution-based text image recognition method according to claim 1, wherein the text image recognition model further includes 1*1 convolution kernels, the inputting the image to be recognized into the text image recognition model, and obtaining the convolution feature map corresponding to the image to be recognized includes:

inputting the image to be identified into the character image identification model, and adopting the mixed convolution kernel to carry out convolution feature extraction on the image to be identified to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel;

carrying out feature map merging operation on feature maps corresponding to each type of convolution kernels to obtain merged feature maps;

And adopting the 1*1 convolution kernel to perform feature fusion operation on the combined feature map to obtain a convolution feature map.

3. The hybrid convolution-based text image recognition method according to claim 1, further comprising, prior to said inputting said image to be recognized into said text image recognition model:

Determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized, and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length;

inputting the image to be identified into the character image identification model, adopting a mixed convolution kernel to carry out convolution feature extraction on the image to be identified, and obtaining a feature map corresponding to each type of convolution kernel in the mixed convolution kernel, wherein the feature map comprises the following components:

After the image to be identified is input into the character image identification model, various convolution kernels in the mixed convolution kernels are moved according to the convolution step length, and the moved convolution kernels are used for carrying out convolution feature extraction on the image to be identified, so that feature maps corresponding to the various convolution kernels are obtained.

4. The hybrid convolution-based text image recognition method of claim 1, further comprising, prior to the obtaining a text image recognition model comprising a hybrid convolution kernel:

acquiring a training sample set, wherein the training sample set comprises character sample images, and each character sample image comprises at least one character;

And inputting the training sample set into an initial character image recognition model for training to obtain the character image recognition model.

5. A hybrid convolution-based text image recognition device comprising:

the convolution identification module is used for inputting the image to be identified into the character image identification model, adopting mixed convolution kernels of the character image identification model to respectively carry out convolution feature extraction on the image to be identified, merging the images after the convolution feature extraction on each convolution kernel, and finally carrying out feature fusion on the merged image to obtain a convolution feature image corresponding to the image to be identified;

the loss calculation module is used for calculating error loss of the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized;

The feature extraction module includes:

The feature conversion unit is used for carrying out feature conversion on the convolution feature graphs with the heights of 4, 2 and 1 respectively to obtain a first transposition feature corresponding to the convolution feature graph with the height of 4, a second transposition feature corresponding to the convolution feature graph with the height of 2 and a third transposition feature corresponding to the convolution feature graph with the height of 1;

The global feature extraction unit is used for inputting the first transposition feature, the second transposition feature and the third transposition feature into the recurrent neural network, and performing global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature;

And the sequence feature merging unit is used for merging the sequence features of the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence feature.

6. The hybrid convolutional text-based image recognition device of claim 5, wherein the hybrid convolutional kernel further comprises 1*1 convolutional kernels, the convolutional recognition module comprising:

The convolution feature extraction unit is used for inputting the image to be identified into the character image identification model, carrying out convolution feature extraction on the image to be identified by adopting the mixed convolution kernel, and obtaining a feature map corresponding to each type of convolution kernel in the mixed convolution kernel;

The feature merging unit is used for carrying out feature graph merging operation on feature graphs corresponding to each type of convolution kernel to obtain merged feature graphs;

And the feature fusion unit is used for carrying out feature fusion operation on the combined feature images by adopting the 1*1 convolution check to obtain a convolution feature image.

7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the hybrid convolutional text image based recognition method according to any one of claims 1 to 4 when executing the computer program.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the hybrid convolution based text image recognition method according to any one of claims 1 to 4.