CN111666931A

CN111666931A - Character and image recognition method, device and equipment based on mixed convolution and storage medium

Info

Publication number: CN111666931A
Application number: CN202010436538.2A
Authority: CN
Inventors: 何胜; 喻宁; 冯晶凌; 柳阳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-15
Anticipated expiration: 2040-05-21
Also published as: CN111666931B

Abstract

The invention relates to artificial intelligence, and provides a character and image identification method, a device, computer equipment and a storage medium based on mixed convolution, wherein an image to be identified is obtained; acquiring a character image recognition model containing a mixed convolution kernel; inputting an image to be recognized into a character image recognition model to obtain a convolution characteristic diagram corresponding to the image to be recognized; inputting the convolution characteristic diagram into a recurrent neural network of a character image recognition model for characteristic extraction to obtain sequence characteristics; inputting the sequence characteristics into a full connection layer of a character image recognition model to obtain a character probability distribution result; and performing error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized. The mixed convolution kernel is adopted in the character image recognition model to extract the convolution characteristics, so that the character recognition accuracy is improved. And a mixed convolution kernel mode is adopted, so that adjacent character areas are fully learned, and the accuracy and the coverage rate of character recognition are improved.

Description

Character and image recognition method, device and equipment based on mixed convolution and storage medium

Technical Field

The invention relates to the technical field of character recognition, in particular to a character image recognition method, a device, equipment and a storage medium based on mixed convolution.

Background

With the development of artificial intelligence technology, the natural language processing field (such as character detection, character recognition, speech recognition or text translation recognition) is also rapidly developed correspondingly. The character detection and character recognition are very widely applied in the fields of guideboard detection and recognition, scene character translation or book text recognition and extraction.

In the prior art, most methods of character detection and character recognition are realized by a convolutional neural network. However, in the prior art, when a convolutional neural network recognizes some special fonts, such as artistic characters, handwritten fonts or irregular fonts, errors are easy to occur, and a general convolutional neural network can only focus local characters in a small coverage area, so that the recognition range is small, and the recognition efficiency is reduced.

Disclosure of Invention

The embodiment of the invention provides a character and image identification method, device and equipment based on mixed convolution and a storage medium, aiming at solving the problem of high character identification error.

A character image recognition method based on mixed convolution comprises the following steps:

acquiring an image to be recognized, wherein the image to be recognized comprises at least one character;

acquiring a character image recognition model containing a mixed convolution kernel, wherein the mixed convolution kernel comprises a 3 × 3 convolution kernel, a 5 × 5 convolution kernel and a 7 × 7 convolution kernel;

inputting the image to be recognized into the character image recognition model to obtain a convolution characteristic diagram corresponding to the image to be recognized;

inputting the convolution characteristic diagram into a cyclic neural network of the character image recognition model for characteristic extraction to obtain sequence characteristics;

inputting the sequence characteristics to a full connection layer of the character image recognition model to obtain a character probability distribution result;

and adopting a preset loss function to calculate the error loss of the character probability distribution result to obtain a character recognition result of the image to be recognized.

A character and image recognition device based on hybrid convolution is characterized by comprising:

the device comprises an image acquisition module, a recognition module and a recognition module, wherein the image acquisition module is used for acquiring an image to be recognized, and the image to be recognized comprises at least one character;

the model acquisition module is used for acquiring a character image recognition model containing a mixed convolution kernel, wherein the mixed convolution kernel comprises a 3 × 3 convolution kernel, a 5 × 5 convolution kernel and a 7 × 7 convolution kernel;

the convolution identification module is used for inputting the image to be identified into the character image identification model to obtain a convolution characteristic diagram corresponding to the image to be identified;

the characteristic extraction module is used for inputting the convolution characteristic diagram into a recurrent neural network of the character image recognition model for characteristic extraction to obtain sequence characteristics;

the character distribution module is used for inputting the sequence characteristics to a full connection layer of the character image recognition model to obtain a character probability distribution result;

and the loss calculation module is used for performing error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the hybrid convolution-based text image recognition method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the hybrid convolution-based text-image recognition method described above.

According to the mixed convolution-based character and image identification method, the device, the equipment and the storage medium, the image to be identified is obtained, and the image to be identified comprises at least one character; acquiring a character image recognition model containing a mixed convolution kernel; the hybrid convolution kernel includes 3 × 3 convolution kernels, 5 × 5 convolution kernels, and 7 × 7 convolution kernels; inputting an image to be recognized into a character image recognition model to obtain a convolution characteristic diagram corresponding to the image to be recognized; inputting the convolution characteristic diagram into a recurrent neural network of a character image recognition model for characteristic extraction to obtain sequence characteristics; inputting the sequence characteristics into a full connection layer of a character image recognition model to obtain a character probability distribution result; and performing error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized. Through the steps, the mixed convolution kernel is adopted in the character image recognition model for convolution feature extraction, the traditional mode of only using a single type of convolution kernel for convolution feature extraction only focuses on local characters in a region with small coverage rate, and when the image to be recognized is artistic fonts or handwritten fonts, the problem of small coverage rate occurs, so that the character recognition accuracy rate is low. And the method of mixing convolution kernels is adopted, and the characteristics of large specification of convolution kernels and large sensing field of view are utilized, so that adjacent character areas can be fully learned, and the accuracy rate and the coverage rate of character recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment based on a hybrid convolution text-image recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a hybrid convolution-based text-image recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart of step S13 in the text-based image hybrid convolution recognition method according to an embodiment of the present invention;

FIG. 4 is a flowchart of step S14 in the text-based image hybrid convolution recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a hybrid convolution-based text-image recognition apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic block diagram of a convolution identifying module in a hybrid convolution-based text-image identifying apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a feature extraction module in a hybrid convolution-based text-based image recognition apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The embodiment of the invention provides a character and image recognition method based on mixed convolution, which can be applied to an application environment shown in figure 1. Specifically, the hybrid convolution-based character and image recognition method is applied to a hybrid convolution-based character and image recognition system, the hybrid convolution-based character and image recognition system comprises a client and a server shown in fig. 1, and the client and the server are in communication through a network and used for a data island problem. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a hybrid convolution-based text-image recognition method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s11: and acquiring an image to be recognized, wherein the image to be recognized comprises at least one character.

S12: acquiring a character image recognition model containing a mixed convolution kernel; the hybrid convolution kernel contains 3 × 3 convolution kernels, 5 × 5 convolution kernels, and 7 × 7 convolution kernels.

The image to be recognized is an image which is waiting to be recognized and contains at least one character. The mixed convolution kernel includes a plurality of convolution kernels, and specifically, the mixed convolution kernel may include a 3 × 3 convolution kernel, a 5 × 5 convolution kernel, a 7 × 7 convolution kernel, or an 11 × 11 convolution kernel, or the like. The character image recognition model is a model for recognizing an image including characters, and a mixed convolution kernel is used as a convolution kernel of the character image recognition model.

Specifically, an original image is obtained, character detection is carried out on the original image, and if the original image contains at least one character, the original image is used as an image to be identified; after the image to be recognized is acquired, a character image recognition model is acquired, and the convolution kernel of the character image recognition model adopts a mixed convolution kernel which comprises a 3 × 3 convolution kernel, a 5 × 5 convolution kernel, a 7 × 7 convolution kernel and the like.

S13: and inputting the image to be recognized into the character image recognition model to obtain a convolution characteristic diagram corresponding to the image to be recognized.

The convolution characteristic diagram is a characteristic diagram containing character characteristics in the image to be recognized.

Specifically, after an image to be recognized and a character image recognition model are obtained, the image to be recognized is input into the character image recognition model, convolution feature extraction is respectively carried out on the image to be recognized by adopting a mixed convolution kernel of the character image recognition model, images after the convolution feature extraction is carried out on each convolution kernel are combined, finally, feature fusion is carried out on the combined images, and a convolution feature map corresponding to the image to be recognized is obtained.

Furthermore, the mixed convolution kernel can effectively utilize the characteristic of large-scale convolution kernel reception field, and adjacent character areas can be fully learned, rather than only focusing on local areas. Illustratively, assuming that the original width of the image to be recognized in the convolution is 800, the width is reduced from 800 to 200 and reduced by 4 times in the process of performing the convolution feature extraction. If feature extraction is continued by using the conventional single-kernel 3 × 3 convolution kernel, the area of the field of 3 × 3 convolution kernel is 12 × 12, and 12 pixels may not cover adjacent characters at the same time, while the area of the field of 7 × 7 convolution kernel is 28 × 28, so that a sufficiently large area can be sensed in most cases, and more sensing information can be acquired.

The method for increasing the receptive field mentioned above can also be achieved by using multilayer convolutional or pooling layer techniques with step sizes greater than 1. However, the technical means of multilayer convolution layers or pooling layers with the step length larger than 1 are excessively used, so that the width of the convolution feature map obtained by extraction is too small, and the feature information extraction prediction of the recurrent neural network at each position in the subsequent step is influenced.

S14: and inputting the convolution characteristic diagram into a recurrent neural network of a character image recognition model for characteristic extraction to obtain sequence characteristics.

The cyclic neural network is used for extracting the characteristics of the sequence context of the convolution characteristic diagram. The sequence features are feature maps with sequence context features.

Specifically, after an image to be recognized is input into a character image recognition model to obtain a convolution feature map corresponding to the image to be recognized, the convolution feature map is input into a recurrent neural network, and feature extraction of a sequence context is performed on the convolution feature map to obtain sequence features.

Alternatively, the structure of the recurrent neural network may be an RNN, LSTM, GRU or BiLSTM structure. Preferably, the recurrent neural network in the above embodiment employs a BilSTM structure. The BilSTM structure is formed by combining a forward LSTM and a backward LSTM, and bidirectional semantic dependence can be better captured through the BilSTM structure, so that semantic errors are avoided.

S15: and inputting the sequence characteristics to a full connection layer of the character image recognition model to obtain a character probability distribution result.

Wherein, the character probability distribution result is the distribution probability of the character at each position.

After the convolution characteristic diagram is input into a recurrent neural network for characteristic extraction to obtain sequence characteristics, the obtained sequence characteristics are input into a full connection layer of a character image recognition model to obtain the distribution probability of each character in the image to be recognized at any position, and then a character probability distribution result is obtained.

S16: and performing error loss calculation on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

The preset loss function is used for measuring the difference degree between input data and actually output data after the input data passes through the neural network. And (4) character recognition results of the characters in the image to be recognized are obtained.

Specifically, after the sequence features are input to a full-connection layer of a character image recognition model to obtain a character probability distribution result, error loss calculation is performed on the character probability distribution result by adopting a preset loss function to obtain a character recognition result of the image to be recognized.

Alternatively, the preset loss function may be a CTC function, an Attention function, or the like. Preferably, the CTC function is employed in the above embodiment. For example, if the characters in the image to be recognized are handwritten, the image to be recognized generated during handwriting often causes the characters to be connected together or the characters to be too separated. For example, when a word to is handwritten, a person who connects t and o may obtain a result of "t-o", or when a certain character occupies a large position during handwriting, so that a plurality of same characters are recognized by position probability distribution, for example, "to" is recognized as "ttoooo", and if a loss function is not adopted for processing at this time, the obtained character recognition result is "t-o"; and finally decoding the result subjected to the loss calculation to obtain a final character recognition result of the image to be recognized, wherein the character recognition result is 'to', so that the error rate of the character recognition result is not increased due to the non-standardization of the handwritten font.

In this embodiment, a mixed convolution kernel is used in the character image recognition model to perform convolution feature extraction, and a conventional method of performing convolution feature extraction only using a single convolution kernel is only to focus local characters in an area with a small coverage rate, but when an image to be recognized is an artistic font or a handwritten font, the problem of small coverage rate occurs, so that the character recognition accuracy is low. And the method of mixing convolution kernels is adopted, and the characteristics of large specification of convolution kernels and large sensing field of view are utilized, so that adjacent character areas can be fully learned, and the accuracy rate and the coverage rate of character recognition can be improved.

In an embodiment, as shown in fig. 3, in step S13, that is, inputting the image to be recognized into the character image recognition model to obtain a convolution feature map corresponding to the image to be recognized, the method specifically includes the following steps:

s131: inputting an image to be recognized into a character image recognition model, and performing convolution feature extraction on the image to be recognized by adopting a mixed convolution kernel to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel.

Specifically, after an image to be recognized and a character image recognition model are acquired, the image to be recognized is input into the character image recognition model; the character image recognition model comprises 3 × 3 convolution kernels, 5 × 5 convolution kernels and 7 × 7 convolution kernels, convolution feature extraction is carried out on the image to be recognized respectively through the 3 × 3 convolution kernels, the 5 × 5 convolution kernels and the 7 × 7 convolution kernels, and a feature map corresponding to the 3 × 3 convolution kernels, a feature map corresponding to the 5 × 5 convolution kernels and a feature map corresponding to the 7 × 7 convolution kernels are obtained.

Exemplarily, it is assumed that the character image recognition model includes 32 3 × 3 convolution kernels, 32 5 × 5 convolution kernels and 32 7 × 7 convolution kernels, and the convolution kernels are used to perform convolution feature extraction on the image to be recognized, so as to obtain 3 feature maps of 40 × 32.

S132: and merging the feature graphs corresponding to each type of convolution kernels to obtain a merged feature graph.

The merged feature map is a feature map generated by merging a plurality of feature maps, and the feature map is usually obtained by merging convolution feature maps generated by a plurality of different convolution kernels.

Specifically, after the mixed convolution kernel is used for performing convolution feature extraction on the image to be identified to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel, combining the feature map corresponding to the 3 × 3 convolution kernel, the feature map corresponding to the 5 × 5 convolution kernel, and the feature map corresponding to the 7 × 7 convolution kernel to obtain a combined feature map.

Exemplarily, it is assumed that 32 3 × 3 convolution kernels, 32 5 × 5 convolution kernels, and 32 7 × 7 convolution kernels are respectively used to perform convolution feature extraction on an image to be recognized, so as to obtain 3 feature maps of 40 × 32, and then, the 3 feature maps of 40 × 32 are merged into a merged feature map of 40 × 96.

S133: and performing feature fusion on the merged feature map by adopting a 1 × 1 convolution kernel to obtain a convolution feature map.

The convolution feature map is a feature map obtained by performing convolution feature extraction, combination and fusion on the image.

Specifically, after merging the feature maps corresponding to each type of convolution kernel to obtain a merged feature map, performing feature fusion on the merged feature map by using a 1 x 1 convolution kernel to obtain a convolution feature map,

illustratively, assume that after merging 3 feature maps of 40 × 32 into a merged feature map of 40 × 96, extracted features are further fused on the merged feature map of 40 × 96 using 64 1 × 1 convolution kernels, resulting in a 40 × 64 convolved feature map.

In this embodiment, the mixed convolution kernels are used for performing convolution feature extraction on the images to be recognized respectively, so that when the characters in the images to be recognized are distorted and are artistic fonts or handwritten fonts, the 5 × 5 convolution kernels or the 7 × 7 convolution kernels are used, so that the character information on two adjacent sides can be sensed better under the above conditions, and the accuracy of character recognition is improved.

In an embodiment, as shown in fig. 4, in step S14, that is, inputting the convolution feature map into a recurrent neural network of the character image recognition model to perform feature extraction, so as to obtain a sequence feature, the method further includes the following steps:

s141: and respectively carrying out feature conversion on the convolution feature maps with the heights of 4, 2 and 1 to obtain a first transposition feature corresponding to the convolution feature map with the height of 4, a second transposition feature corresponding to the convolution feature map with the height of 2 and a third transposition feature corresponding to the convolution feature map with the height of 1.

The first transposition characteristic, the second transposition characteristic and the third transposition characteristic are obtained by performing characteristic conversion on an original convolution characteristic diagram, and the essence of the first transposition characteristic, the second transposition characteristic and the third transposition characteristic is a sequence type characteristic.

Specifically, in the process of extracting the convolution features of the image to be recognized by the character image recognition model, namely in the process of reducing the height of the image to be recognized by convolution, feature conversion is respectively carried out on convolution feature graphs with the heights of 4, 2 and 1, so that a first transposition feature corresponding to the convolution feature graph with the height of 4, a second transposition feature corresponding to the convolution feature graph with the height of 2 and a third transposition feature corresponding to the convolution feature graph with the height of 1 are obtained.

Exemplarily, in the process of inputting an image to be recognized into the character image recognition model and extracting convolution features of the image to be recognized, assuming that the currently extracted intermediate feature map is 8 × 200 × C, in the process of reducing the height of the convolution feature map by subsequent convolution, when the heights of the convolution feature map are respectively 4, 2 and 1, performing feature conversion on the convolution feature map with the height of 4 to obtain a first transfer feature of 200 × 4C; performing feature conversion on the convolution feature map with the height of 2 to obtain a second transposition feature of 200 × 2 × C; and performing feature transformation on the convolution feature map with the height of 1 to obtain a third transposed feature of 200 × 1 × C.

In 8 × 200 × C, 8 denotes the height of the image, 200 denotes the width of the image, and C denotes the number of characteristic channels of the image.

S142: and inputting the first transposition feature, the second transposition feature and the third transposition feature into a recurrent neural network, and performing global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature.

The recurrent neural network is a recurrent neural network which takes sequence data as input, recurs in the evolution direction of the sequence and all nodes are connected in a chain manner. And the cyclic neural network is used for extracting the characteristics of the sequence context of the convolution characteristic diagram.

Specifically, after a first transposition feature, a second transposition feature and a third transposition feature are obtained, the first transposition feature, the second transposition feature and the third transposition feature are input into a recurrent neural network, and global feature extraction is performed on the first transposition feature, the second transposition feature and the third transposition feature, so that a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature are obtained.

Optionally, the structure of the recurrent neural network includes RNN, LSTM, BiLSTM, GRU, and the like. Preferably, a recurrent neural network of the BilSTM structure is employed in this embodiment. Illustratively, after obtaining the first transposed feature of 200 × 4 × C, the second transposed feature of 200 × 2 × C, and the third transposed feature of 200 × 1 × C, the first transposed feature, the second transposed feature, and the third transposed feature are subjected to global feature extraction by using a recurrent neural network of a BiLSTM structure, respectively, to obtain 3 sequence features of 200 × H, that is, the first sequence feature, the second sequence feature, and the third sequence feature are all 200 × H.

In 200 × H, H represents the size of the hidden layer in the BiLSTM structure.

S143: and combining the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence feature.

Wherein, the sequence features are feature graphs with sequence context features.

Specifically, after the first sequence feature, the second sequence feature and the third sequence feature are obtained, the first sequence feature, the second sequence feature and the third sequence feature are subjected to sequence feature combination to obtain the sequence feature.

Exemplarily, it is assumed that a recurrent neural network of a BiLSTM structure is respectively adopted to perform global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain 3 sequence features of 200 × H, and then sequence features of the first sequence feature, the second sequence feature and the third sequence feature are combined to obtain a sequence feature of 200 × 3H.

In this embodiment, feature conversion is performed on the convolution feature maps with heights of 4, 2 and 1, and a multi-scale feature map supervision mode can be adopted in the feature conversion process, instead of directly performing judgment to the last scale by convolution kernel in the conventional method. The information transfer capability in the convolution process is improved, and the robustness of the sequence characteristics is improved.

In one embodiment, before step S13, that is, before the image to be recognized is input into the text image recognition model, the method further includes the following steps:

and determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized, and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length.

Wherein the convolution step is the length of convolution kernel movement.

Specifically, after an image to be recognized and a character image processing model are obtained, determining the maximum length of an actual sequence of the image to be recognized according to the width of the image to be recognized, and determining the maximum sequence length corresponding to the image to be recognized according to the maximum length of the actual sequence of the image to be recognized; after the maximum sequence length is determined, the convolution step size of the mixed convolution kernel in the character image recognition model is determined according to the maximum sequence length.

Illustratively, assuming that the width of the image to be recognized is 800 and the actual sequence maximum length is 96, the maximum sequence length is determined to be 200 according to the actual sequence maximum length. Since 800/200 is an integer multiple of 2, only two convolution operations with step length of 2 need to be performed to obtain a regular feature map in the process of inputting the image to be recognized into the character image recognition model for extracting the convolution features.

Inputting an image to be recognized into a character image recognition model, performing convolution feature extraction on the image to be recognized by adopting a mixed convolution kernel to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel, and the method comprises the following steps:

and after the image to be recognized is input into the character image recognition model, various convolution kernels in the mixed convolution kernels are moved according to the convolution step length, and the convolution feature extraction is carried out on the image to be recognized through the moved convolution kernels to obtain feature maps corresponding to the various convolution kernels.

Specifically, after the maximum sequence length corresponding to the image to be recognized is determined according to the width of the image to be recognized, and the convolution step length of the mixed convolution kernel of the character image recognition model is determined according to the maximum sequence length, the image to be recognized is input into the character image recognition model, various convolution kernels in the mixed convolution kernels in the character image recognition model move according to the convolution step length, and in the moving process, the convolution kernels in the mixed convolution kernels are adopted to respectively extract convolution characteristics of the image to be recognized, so that characteristic graphs corresponding to the various convolution kernels are obtained.

In the present embodiment, the parameter for the maximum sequence length is conventionally set arbitrarily. The embodiment is determined according to the width of the image to be recognized, and the maximum sequence parameter with regularity is used, so that more regular information can be acquired during convolution feature extraction, and the accuracy of character recognition is improved.

In one embodiment, before step S12, that is, before obtaining the text image recognition model containing the hybrid convolution kernel, the method further includes the following steps:

a training sample set is obtained, wherein the training sample set comprises text sample images, and each text sample image comprises at least one character.

An initial text image recognition model is obtained that includes a mixed convolution kernel comprising 3 x 3 convolution kernels, 5 x 5 convolution kernels, and 7 x 7 convolution kernels.

The training sample set is a set of text sample images to be trained. The text sample images are text images to be trained, and all the text sample images contain at least one character. The initial character image recognition model is a basic model for recognizing an image containing characters.

Specifically, character sample images under a plurality of different scenes are obtained, character detection is performed on all the character sample images, and if at least one character is included in the character sample images, the character sample images are stored in a training sample set. After the training sample set is determined, an initial character image recognition model is obtained, and the convolution kernel of the initial character image recognition model adopts a mixed convolution kernel which comprises a 3 × 3 convolution kernel, a 5 × 5 convolution kernel, a 7 × 7 convolution kernel and the like.

And inputting the training sample set into the initial character and image recognition model for training to obtain the character and image recognition model.

The character image recognition model is a model obtained after training according to a training sample set on the basis of an initial character image recognition model.

The process of establishing the character image recognition model comprises the following steps: inputting a training sample set into an initial character and image recognition model; defining a mixed convolution kernel moving step length in an initial character image recognition model according to character sample images in a training sample set; defining structures of a convolutional neural network, a cyclic neural network and a full connection layer in an initial character image recognition model, defining a forward propagation process, and performing iterative training on the initial character image recognition model by using the defined forward propagation process to obtain a trained model; and testing and verifying the trained model to obtain a character and image recognition model.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In one embodiment, a hybrid convolution-based character image recognition apparatus is provided, and the hybrid convolution-based character image recognition apparatus is in one-to-one correspondence with the hybrid convolution-based character image recognition method in the above embodiment. As shown in fig. 5, the hybrid convolution-based character image recognition apparatus includes an image acquisition module 11, a model acquisition module 12, a convolution recognition module 13, a feature extraction module 14, a character distribution module 15, and a loss calculation module 16. The functional modules are explained in detail as follows:

the image obtaining module 11 is configured to obtain an image to be recognized, where the image to be recognized includes at least one character.

And the model acquisition module 12 is used for acquiring the character image recognition model containing a mixed convolution kernel, wherein the mixed convolution kernel contains a 3 × 3 convolution kernel, a 5 × 5 convolution kernel and a 7 × 7 convolution kernel.

And the convolution identification module 13 is configured to input the image to be identified into the character image identification model, so as to obtain a convolution feature map corresponding to the image to be identified.

And the feature extraction module 14 is configured to input the convolution feature map into a recurrent neural network of the character image recognition model to perform feature extraction, so as to obtain a sequence feature.

And the character distribution module 15 is used for inputting the sequence characteristics to a full connection layer of the character image recognition model to obtain a character probability distribution result.

And the loss calculation module 16 is configured to perform error loss calculation on the character probability distribution result by using a preset loss function, so as to obtain a character recognition result of the image to be recognized.

Preferably, as shown in fig. 6, the convolution identifying module 13 includes the following units:

and the convolution feature extraction unit 131 is configured to input the image to be recognized into the character image recognition model, perform convolution feature extraction on the image to be recognized by using a mixed convolution kernel, and obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel.

And a feature merging unit 132, configured to merge feature maps corresponding to each type of convolution kernel to obtain a merged feature map.

And a feature fusion unit 133, configured to perform feature fusion on the merged feature map by using a 1 × 1 convolution kernel to obtain a convolution feature map.

Preferably, as shown in fig. 7, the feature extraction module 14 includes the following units:

the feature conversion unit 141 is configured to perform feature conversion on the convolution feature maps with heights of 4, 2, and 1, respectively, to obtain a first transposition feature corresponding to the convolution feature map with height of 4, a second transposition feature corresponding to the convolution feature map with height of 2, and a third transposition feature corresponding to the convolution feature map with height of 1.

The global feature extraction unit 142 is configured to input the first transposition feature, the second transposition feature, and the third transposition feature to the recurrent neural network, and perform global feature extraction on the first transposition feature, the second transposition feature, and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature, and a third sequence feature corresponding to the third transposition feature.

And a sequence feature merging unit 143, configured to merge the first sequence feature, the second sequence feature, and the third sequence feature to obtain a sequence feature.

Preferably, the device for recognizing the character image based on the hybrid convolution further comprises the following modules:

and the convolution step length determining module is used for determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length.

the training sample set acquisition module is used for acquiring a training sample set, wherein the training sample set comprises character sample images, and each character sample image comprises at least one character.

And the initial model acquisition module is used for acquiring an initial character image recognition model containing a mixed convolution kernel, wherein the mixed convolution kernel comprises a 3 × 3 convolution kernel, a 5 × 5 convolution kernel and a 7 × 7 convolution kernel.

And the model training module is used for inputting the training sample set to the initial character and image recognition model for training to obtain the character and image recognition model.

For the specific limitation of the hybrid convolution-based text image recognition device, reference may be made to the above limitation of the hybrid convolution-based text image recognition method, and details thereof are not repeated here. The modules in the device for recognizing the character image based on the mixed convolution can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing the data used in the character and image recognition method based on the mixed convolution. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a hybrid convolution based text image recognition method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the hybrid convolution-based text image recognition method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, the computer program, when executed by a processor, implements the data processing model generation method in the above-described embodiments, or the processor, when executing the computer program, implements the hybrid convolution-based text image recognition method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A character and image recognition method based on mixed convolution is characterized by comprising the following steps:

2. The method for recognizing the character image based on the hybrid convolution of claim 1, wherein the character image recognition model further comprises a 1-by-1 convolution kernel, and the step of inputting the image to be recognized into the character image recognition model to obtain the convolution feature map corresponding to the image to be recognized comprises the steps of:

inputting the image to be recognized into the character image recognition model, and performing convolution feature extraction on the image to be recognized by adopting the mixed convolution kernel to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernels;

carrying out feature map merging operation on the feature maps corresponding to the convolution kernels of each type to obtain merged feature maps;

and performing feature fusion operation on the merged feature map by using the 1 x 1 convolution kernel to obtain a convolution feature map.

3. The method for recognizing the character image based on the hybrid convolution of claim 1, wherein the step of inputting the convolution feature map into a recurrent neural network of the character image recognition model for feature extraction to obtain sequence features comprises the steps of:

respectively carrying out feature conversion on the convolution feature maps with the heights of 4, 2 and 1 to obtain a first transposition feature corresponding to the convolution feature map with the height of 4, a second transposition feature corresponding to the convolution feature map with the height of 2 and a third transposition feature corresponding to the convolution feature map with the height of 1;

inputting the first transposition feature, the second transposition feature and the third transposition feature into the recurrent neural network, and performing global feature extraction on the first transposition feature, the second transposition feature and the third transposition feature to obtain a first sequence feature corresponding to the first transposition feature, a second sequence feature corresponding to the second transposition feature and a third sequence feature corresponding to the third transposition feature;

and combining the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence feature.

4. The hybrid convolution-based text-image recognition method of claim 1, wherein before the inputting the image to be recognized into the text-image recognition model, further comprising:

determining the maximum sequence length corresponding to the image to be recognized according to the width of the image to be recognized, and determining the convolution step length of the mixed convolution kernel of the character image recognition model according to the maximum sequence length;

inputting the image to be recognized into the character image recognition model, performing convolution feature extraction on the image to be recognized by adopting a mixed convolution kernel to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel, and the method comprises the following steps:

and after the image to be recognized is input into the character image recognition model, moving various convolution kernels in the mixed convolution kernels according to the convolution step length, and performing convolution feature extraction on the image to be recognized through the moved convolution kernels to obtain feature maps corresponding to the various convolution kernels.

5. The hybrid convolution-based text image recognition method of claim 1, wherein before the obtaining the text image recognition model including the hybrid convolution kernel, the method further includes:

acquiring a training sample set, wherein the training sample set comprises character sample images, and each character sample image comprises at least one character;

and inputting the training sample set to an initial character and image recognition model for training to obtain the character and image recognition model.

6. A hybrid convolution-based text-image recognition apparatus, comprising:

7. The hybrid convolution-based text image recognition apparatus of claim 6 wherein the hybrid convolution kernel further includes a 1 x 1 convolution kernel, the convolution identification module including:

a convolution feature extraction unit, configured to input the image to be recognized into the character image recognition model, and perform convolution feature extraction on the image to be recognized by using the mixed convolution kernel to obtain a feature map corresponding to each type of convolution kernel in the mixed convolution kernel;

the characteristic merging unit is used for carrying out characteristic diagram merging operation on the characteristic diagram corresponding to each type of convolution kernel to obtain a merged characteristic diagram;

and the feature fusion unit is used for performing feature fusion operation on the merged feature map by adopting the 1 × 1 convolution kernel to obtain a convolution feature map.

8. The hybrid convolution-based text image recognition apparatus of claim 6 wherein the feature extraction module includes:

the feature conversion unit is used for respectively carrying out feature conversion on the convolution feature maps with the heights of 4, 2 and 1 to obtain a first transposition feature corresponding to the convolution feature map with the height of 4, a second transposition feature corresponding to the convolution feature map with the height of 2 and a third transposition feature corresponding to the convolution feature map with the height of 1;

a global feature extraction unit, configured to input the first transposed feature, the second transposed feature, and the third transposed feature to the recurrent neural network, and perform global feature extraction on the first transposed feature, the second transposed feature, and the third transposed feature to obtain a first sequence feature corresponding to the first transposed feature, a second sequence feature corresponding to the second transposed feature, and a third sequence feature corresponding to the third transposed feature;

and the sequence feature merging unit is used for merging the sequence features of the first sequence feature, the second sequence feature and the third sequence feature to obtain the sequence feature.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the hybrid convolution based text image recognition method according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the hybrid convolution-based text-image recognition method according to any one of claims 1 to 5.