CN110008961B

CN110008961B - Text real-time identification method, text real-time identification device, computer equipment and storage medium

Info

Publication number: CN110008961B
Application number: CN201910256927.4A
Authority: CN
Inventors: 张欢; 李爱林; 张仕洋
Original assignee: Shenzhen Huafu Technology Co ltd
Current assignee: Shenzhen Huafu Technology Co ltd
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2023-05-12
Anticipated expiration: 2039-04-01
Also published as: CN110008961A

Abstract

The invention relates to a real-time character recognition method, a device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining image data to be recognized; inputting the image data to be identified into a character identification model for character identification to obtain an identification result; aligning the recognition result by adopting a CTC loss function to obtain a character sequence; the character recognition model is obtained by training a convolutional neural network by taking the identified image data as sample data. According to the invention, the image data to be recognized is input into the character recognition model for character recognition, in the training process of the character model, convolution calculation is adopted, the convergence speed is accelerated by combining the pooling layer downsampling and the batch normalization layer and the lost layer, so that the stability is improved, the overfitting is prevented, the convolution kernel is changed, the calculated amount is reduced, the characters can be recognized with low power consumption, and the character recognition speed is improved.

Description

Text real-time identification method, text real-time identification device, computer equipment and storage medium

Technical Field

The present invention relates to a text recognition method, and more particularly, to a text real-time recognition method, apparatus, computer device, and storage medium.

Background

The text detection process comprises text positioning and text recognition, the traditional text recognition system mostly adopts a traditional computer vision algorithm, a neural network is not adopted, the accuracy is low, the recognition is mostly influenced by the fact that the prior character segmentation is needed, the segmentation error further influences the recognition, the specific scheme is that the character segmentation is carried out, the segmented characters are respectively classified, and then the post-processing is carried out to connect all the recognized characters into a final recognition result. The algorithm divides recognition into two steps, the error generated in the first step is only used as an intermediate step, a segmentation result is not necessarily needed, and the segmentation error can be transmitted to the next step, so that the accuracy of single character classification is seriously affected, and the final recognition effect is affected.

In addition, there is a new recognition method at present, a neural network with good effect is adopted to train out a text recognition model, the model is used to recognize texts, generally, text line recognition is a sequence-to-sequence problem, namely, picture information, namely, pixel sequences are input, and a text sequence is output, at this time, an RNN model based on LSTM can well solve the sequence problem due to good sequence modeling capability, however, in terms of power consumption and speed, LSTM is very unfavorable for mobile terminal deployment compared with convolution. Moreover, the image sequence has no time dependency naturally, the modeling by using a heavy LSTM is not the only optimal choice, and the neural network character recognition mostly needs to consume a large amount of computing resources and cannot deviate from the cloud environment.

Therefore, it is necessary to design a new method to realize recognition of characters with low power consumption and to increase the speed of character recognition.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method, a device, computer equipment and a storage medium for identifying characters in real time.

In order to achieve the above purpose, the present invention adopts the following technical scheme: the text real-time identification method comprises the following steps:

acquiring image data to be identified;

inputting the image data to be identified into a character identification model for character identification to obtain an identification result;

aligning the recognition result by adopting a CTC loss function to obtain a character sequence;

the character recognition model is obtained by training a convolutional neural network by taking the identified image data as sample data.

The further technical scheme is as follows: the character recognition model is obtained by training a convolutional neural network by taking image data with a mark as sample data, and comprises the following steps:

constructing a loss function and a convolutional neural network;

acquiring image data with a mark to obtain sample data;

inputting the sample data into a convolutional neural network for convolutional calculation so as to obtain a sample output result;

Inputting the sample output result and the image data with the mark into a loss function to obtain a loss value;

adjusting parameters of the convolutional neural network according to the loss value;

and learning the convolutional neural network by using the sample data and adopting a deep learning framework to obtain a character recognition model.

The further technical scheme is as follows: the step of inputting the sample data into the convolutional neural network to perform convolutional calculation so as to obtain a sample output result comprises the following steps:

performing convolution processing with a convolution kernel 3*3 on the sample data to obtain a first output result;

carrying out maximum pooling treatment on the first output result to obtain a second output result;

performing cross convolution processing on the second output result to obtain a third output result;

carrying out mean value pooling treatment on the third output result to obtain a fourth output result;

performing convolution processing with a convolution kernel 3*3 and cross convolution processing on the third output result to obtain a fifth output result;

splicing the fourth output result and the fifth output result to obtain a twenty-first output result;

performing cross convolution processing on the twenty-first output result to obtain a seventh output result;

splicing the seventh output result and the fourth output result to obtain a sixth output result;

Performing cross convolution processing on the sixth output result to obtain an eighth output result;

carrying out maximum pooling treatment on the eighth output result to obtain a ninth output result;

processing the adjacent areas of the discarded layer feature map on the ninth output result to obtain a tenth output result;

carrying out mean value pooling treatment on the seventh output result to obtain an eleventh output result;

splicing the tenth output result and the eleventh output result to obtain a twelfth output result;

performing cross convolution processing on the twelfth output result to obtain a thirteenth output result;

performing convolution processing with a convolution kernel 3*3 on the thirteenth output result to obtain a fourteenth output result;

processing the fourteenth output result in the adjacent area of the discarded layer feature map to obtain a fifteenth output result;

performing convolution processing with a convolution kernel 3*3 on the fifteenth output result to obtain a sixteenth output result;

global pooling is carried out on the sixteenth output result so as to obtain a seventeenth output result;

fully connecting the seventeenth output result to obtain an eighteenth output result;

tiling the eighteenth output result to obtain a nineteenth output result;

Splicing the nineteenth output result and the sixteenth output result to obtain a twentieth output result;

the twentieth output result is subjected to convolution processing with convolution kernels 1*8 and 8*1 to obtain a sample output result.

The further technical scheme is as follows: the cross convolution processing is performed on the second output result to obtain a third output result, including:

performing convolution processing with a convolution kernel 1*1 on the second output result to obtain a preliminary result;

performing convolution processing with a convolution kernel of 1*3 on the primary result to obtain a secondary result;

performing convolution processing with a convolution kernel of 3*1 on the secondary result to obtain a tertiary result;

and performing convolution processing with a convolution kernel of 1*1 on the three results to obtain a third output result.

The further technical scheme is as follows: and carrying out mean value pooling processing on the third output result to obtain a fourth output result, wherein the method comprises the following steps:

and averaging adjacent pixels in the third output result to obtain a fourth output result.

The further technical scheme is as follows: after the recognition result is aligned by adopting the CTC loss function to obtain the character sequence, the method further comprises:

and outputting the character sequence.

The invention also provides a text real-time identification device, which comprises:

The data acquisition unit is used for acquiring image data to be identified;

the recognition unit is used for inputting the image data to be recognized into the character recognition model to perform character recognition so as to obtain a recognition result;

and the alignment unit is used for aligning the identification result by adopting a CTC loss function so as to obtain a character sequence.

The further technical scheme is as follows: the apparatus further comprises:

and the training unit is used for training the convolutional neural network by taking the identified image data as sample data so as to obtain a character recognition model.

The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.

The present invention also provides a storage medium storing a computer program which, when executed by a processor, performs the above-described method.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the image data to be recognized is input into the character recognition model for character recognition, in the training process of the character model, convolution calculation is adopted, the convergence speed is accelerated by combining the pooling layer downsampling and the batch normalization layer and the lost layer, so that the stability is improved, the overfitting is prevented, the convolution kernel is changed, the calculated amount is reduced, the characters can be recognized with low power consumption, and the character recognition speed is improved.

The invention is further described below with reference to the drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a text real-time recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a text real-time recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic sub-flowchart of a text real-time recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic sub-flowchart of a text real-time recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic sub-flowchart of a text real-time recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a cross-convolution process provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a leveling process according to an embodiment of the present invention;

FIG. 8 is a flowchart of a text real-time recognition method according to another embodiment of the present invention;

FIG. 9 is a schematic block diagram of a text real-time recognition device according to an embodiment of the present invention;

FIG. 10 is a schematic block diagram of a text real-time recognition device according to another embodiment of the present invention;

fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an application scenario of a text real-time recognition method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a text real-time recognition method provided by an embodiment of the invention. The real-time character recognition method is applied to a server, the server performs data interaction with a terminal, the terminal shoots to obtain image data to be recognized, the image data to be recognized is transmitted to the server, a character recognition model in the server performs character recognition on the image data to be recognized, a real character sequence, namely character information, is obtained after the recognition result is aligned, and the character information can be transmitted to the terminal or the corresponding response is made by using the character information to control the terminal.

Fig. 2 is a flow chart of a text real-time recognition method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S130.

S110, acquiring image data to be identified.

In the present embodiment, the image data to be recognized refers to image data obtained by photographing by a terminal, or may be obtained by scanning or the like.

S120, inputting the image data to be recognized into a character recognition model for character recognition so as to obtain a recognition result.

In the present embodiment, the recognition result refers to a probability sequence of characters having a length of about 50 to 200.

In one embodiment, referring to fig. 3, the text recognition model training steps may include steps S121 to S126.

S121, constructing a loss function and a convolutional neural network.

In this embodiment, convolutional neural networks are constructed to perform convolutional computation on image data to achieve the effects of classification and target positioning, and each network needs to use a loss function to perform loss value computation in the training process, where the loss value represents the difference between the output result and the actual result, and the smaller the loss value is, the smaller the difference is, which indicates that the network trains better, and vice versa. The convolutional neural network is widely used for computer vision tasks such as target detection, semantic segmentation, object classification and the like, achieves a very good effect, and shows good adaptability to the vision tasks.

S122, acquiring the image data with the marks to obtain sample data.

In this embodiment, the sample data refers to image data with text labels, and the sample data may be divided into a plurality of training sets and a small part of test sets, where the training sets are used to train the convolutional neural network to select the convolutional neural network with a smaller loss value, and the test sets are used to test.

S123, inputting the sample data into a convolutional neural network to perform convolutional calculation so as to obtain a sample output result.

In this embodiment, the sample output result refers to a probability sequence, that is, a text sequence number predicted by the sample data.

In one embodiment, referring to fig. 4, the step S123 may include steps S123a to S123v.

S123a, performing convolution processing with a convolution kernel of 3*3 on the sample data to obtain a first output result;

and S123b, carrying out maximum pooling processing on the first output result to obtain a second output result.

In the present embodiment, the max-pooling process refers to a maximum value of pixels of a read image.

And S123c, performing cross convolution processing on the second output result to obtain a third output result.

In this embodiment, referring to fig. 5, the step S123c may include steps S123c1 to S123c4.

S123c1, performing convolution processing with a convolution kernel 1*1 on the second output result to obtain a preliminary result;

S123c2, performing convolution processing with a convolution kernel 1*3 on the primary result to obtain a secondary result;

s123c3, performing convolution processing with a convolution kernel of 3*1 on the secondary result to obtain a tertiary result;

and S123c4, performing convolution processing with a convolution kernel of 1*1 on the three results to obtain a third output result.

As shown in fig. 6, the middle part of the graph is a convolution kernel, the convolution kernel which is usually 3*3 is adopted at present, the middle layer convolution kernel can multiply the correlation of the previous features, the calculated amount is large, the convolution processing with the convolution kernels 1*3 and 3*1 overlapped is adopted to replace the convolution processing with the convolution kernel 3*3, and the convolution processing with the convolution kernel 1*1 is used before and after to form a bottleneck, so that the calculated amount is reduced.

And S123d, carrying out mean value pooling processing on the third output result to obtain a fourth output result.

Specifically, adjacent pixels in the third output result are averaged to obtain a fourth output result.

Since the resolution of the individual features is different at the time of stitching, the information is aligned in a mean-pooling manner for pictures of large resolution, so-called mean-pooling, i.e. averaging at adjacent pixels, in order to reduce the average, as shown in fig. 7.

S123e, performing convolution processing with a convolution kernel 3*3 and cross convolution processing on the third output result to obtain a fifth output result;

S123f, splicing the fourth output result and the fifth output result to obtain a twenty-first output result;

s123g, performing cross convolution processing on the twenty-first output result to obtain a seventh output result;

s123h, splicing the seventh output result and the fourth output result to obtain a sixth output result;

s123i, performing cross convolution processing on the sixth output result to obtain an eighth output result;

s123j, carrying out maximum pooling treatment on the eighth output result to obtain a ninth output result;

s123k, carrying out adjacent area processing of the discarding layer characteristic diagram on the ninth output result to obtain a tenth output result;

s123l, carrying out mean value pooling treatment on the seventh output result to obtain an eleventh output result;

s123m, splicing the tenth output result and the eleventh output result to obtain a twelfth output result;

s123n, performing cross convolution processing on the twelfth output result to obtain a thirteenth output result;

s123o, performing convolution processing with a convolution kernel 3*3 on the thirteenth output result to obtain a fourteenth output result;

s123p, carrying out adjacent area processing of the discarded layer feature map on the fourteenth output result to obtain a fifteenth output result;

S123q, performing convolution processing with a convolution kernel of 3*3 on the fifteenth output result to obtain a sixteenth output result;

s123r, performing global pooling processing on the sixteenth output result to obtain a seventeenth output result;

S123S, performing full connection on the seventeenth output result to obtain an eighteenth output result;

s123t, tiling the eighteenth output result to obtain a nineteenth output result;

s123u, splicing the nineteenth output result and the sixteenth output result to obtain a twentieth output result;

and S123v, performing convolution processing with convolution kernels 1*8 and 8*1 on the twentieth output result to obtain a sample output result.

And connecting shallow layer and deep layer characteristics for multiple times, and extracting the characteristics of the image sequence. And splicing the features extracted in the early stage of the network, namely shallow features, with the features extracted after the features are subjected to cross convolution or convolution processing with a convolution kernel of 3*3 in the channel dimension to obtain a feature map with a length of W (the number of commonly used Chinese characters is 8500) and a width of W (the number can be set to be between 50 and 200), and cutting the feature map along the width to obtain a feature sequence with a length of W, namely a probability sequence.

The cross convolution processing is to perform the convolution processing with the convolution kernel 1*1 first, then perform the convolution processing with the convolution kernel 1*1, then perform the convolution processing with the convolution kernel 3*1, and finally perform the convolution processing with the convolution kernel 1*1. The convolution calculation is adopted, the pooling layer downsampling is combined, the convergence speed is accelerated by the batch normalization layer and the lost layer, the stability is improved, the overfitting is prevented, the random discarding feature is effective to the full-connection layer, and experiments show that the random discarding feature is not so effective to the convolution layer, so that the latest discarding mode aiming at the convolution layer is adopted, and the network robustness is enhanced.

At the end of the convolution network, large transverse and lateral convolution kernels (1*8 and 8*1) are adopted, so that the influence caused by the lack of LSTM is overcome while the small calculation amount is kept, and the correlation information between the transverse positions and the longitudinal positions is well utilized due to the fact that the convolution kernels are long in the transverse direction and the longitudinal direction (both are 8). The LSTM is mainly applied to the fields of voice processing, natural language processing and the like originally, can well process the problem that a sequence is input to a sequence output, and a character recognition task, because a picture can be divided into a picture sequence, and the output is also the character sequence, the picture can be processed by adopting a sequence-to-sequence framework, however, unlike voice and the difference, the picture is natural and has only a left-right structure, the sequence relation of the character picture from left to right is not as dependent as voice, and the LSTM network can be well replaced by using long-core convolution to process the character recognition picture.

S124, inputting the sample output result and the image data with the mark into a loss function to obtain a loss value;

s125, adjusting parameters of the convolutional neural network according to the loss value;

and S126, learning the convolutional neural network by using the sample data and adopting a deep learning framework to obtain a character recognition model.

Parameters of the convolutional neural network are continuously adjusted, repeated learning and training are performed, so that the convolutional neural network meeting requirements is obtained, specifically, the tensorf low training is adopted, and after the convolutional neural network is converted into a corresponding character recognition model, the convolutional neural network is very easily deployed on a server or a terminal through tensorflow tflite and tensorf low map. It not only supports normal controller operation, but also can accelerate the controller on the corresponding device through opencl (full name open computing language, open operation language).

The obtained character recognition model has only about 0.22 Gflow in single forward calculation, and the forward calculation can process a large number of character recognition tasks in real time. The complex RNN (recurrent neural network ) model has the advantages that a great amount of computational power and memory requirements on embedded equipment are eliminated, and in addition, a character recognition algorithm capable of being put into practical use needs to face a series of problems of blurring of pictures, poor illumination, physical deformation and the like. This problem can be carefully handled through fine and extensive text augmentation and generation, making the algorithm very effective in real scenarios, specific business inspections.

S130, aligning the recognition result by adopting a CTC loss function to obtain a character sequence;

the word recognition model outputs a sequence of probabilities of a string of characters having a length of about 50 to 200. Since the final objective is to obtain a true character sequence, i.e. the number of literal characters in the image data to be recognized, such as the sequence of typically 7 digits in a license plate, it is necessary to align the two. The character sequence is obtained by using a very large number of CTC loss functions used in speech recognition to align the two.

The method runs on android equipment RK3399, recognizes 8 digits for several typical characters, and has the accuracy rate of about 99.1% on a private test set and the speed of about 20 milliseconds; the accuracy of identifying 14-bit chinese characters on the private test set is about 98.8% at a speed of about 46 milliseconds.

According to the character real-time recognition method, the image data to be recognized is input into the character recognition model to recognize characters, in the training process of the character model, convolution calculation is adopted, the convergence speed is accelerated by combining the pooling layer downsampling and the batch normalization layer and the lost layer, stability is improved, overfitting is prevented, the convolution kernel is changed, so that the calculated amount is reduced, characters can be recognized with low power consumption, and the character recognition speed is improved.

Fig. 8 is a flowchart of a text real-time recognition method according to another embodiment of the present invention. As shown in fig. 8, the text real-time recognition method of the present embodiment includes steps S210 to S240. Steps S210 to S230 are similar to steps S110 to S130 in the above embodiment, and are not described herein. Step S240 added in the present embodiment is described in detail below.

S240, outputting a character sequence.

And outputting the character sequence obtained by recognition to a terminal for display or carrying out corresponding response according to the output character sequence number, such as calling corresponding data and the like.

Fig. 9 is a schematic block diagram of a text real-time recognition device 300 according to an embodiment of the present invention. As shown in fig. 9, the present invention further provides a text real-time recognition device 300 corresponding to the above text real-time recognition method. The text real-time recognition apparatus 300 includes a unit for performing the text real-time recognition method described above, and may be configured in a server or a terminal.

Specifically, referring to fig. 9, the text real-time recognition apparatus 300 includes:

a data acquisition unit 301 for acquiring image data to be identified;

the recognition unit 302 is configured to input image data to be recognized into a text recognition model for text recognition, so as to obtain a recognition result;

An alignment unit 303, configured to align the recognition result by using a CTC loss function, so as to obtain a character sequence.

In an embodiment, the device further comprises:

In an embodiment, the training unit comprises:

a construction subunit, configured to construct a loss function and a convolutional neural network;

a sample data forming subunit, configured to obtain image data with a identifier, so as to obtain sample data;

the calculation subunit is used for inputting the sample data into the convolutional neural network to perform convolutional calculation so as to obtain a sample output result;

the loss value acquisition subunit is used for inputting the sample output result and the image data with the mark into a loss function so as to obtain a loss value;

the parameter adjusting subunit is used for adjusting parameters of the convolutional neural network according to the loss value;

and the learning subunit is used for learning the convolutional neural network by using the sample data and adopting a deep learning framework so as to obtain a character recognition model.

In an embodiment, the computing subunit comprises:

the first convolution processing module is used for carrying out convolution processing with a convolution kernel of 3*3 on the sample data so as to obtain a first output result;

The first maximum pooling module is used for carrying out maximum pooling treatment on the first output result so as to obtain a second output result;

the second convolution processing module is used for performing cross convolution processing on the second output result to obtain a third output result;

the first averaging module is used for carrying out averaging and pooling treatment on the third output result so as to obtain a fourth output result;

the third convolution processing module is used for performing convolution processing with a convolution kernel of 3*3 and cross convolution processing on the third output result to obtain a fifth output result;

the first splicing module is used for splicing the fourth output result and the fifth output result to obtain a twenty-first output result;

the fourth convolution processing module is used for performing cross convolution processing on the twenty-first output result to obtain a seventh output result;

the second splicing module is used for splicing the seventh output result and the fourth output result to obtain a sixth output result;

a fifth convolution processing module, configured to perform cross convolution processing on the sixth output result to obtain an eighth output result;

the second maximum pooling module is used for carrying out maximum pooling treatment on the eighth output result so as to obtain a ninth output result;

The first discarding module is used for processing the neighboring region of the discarding layer characteristic diagram on the ninth output result to obtain a tenth output result;

the second averaging module is used for carrying out averaging and pooling on the seventh output result so as to obtain an eleventh output result;

the third splicing module is used for splicing the tenth output result and the eleventh output result to obtain a twelfth output result;

a sixth convolution processing module, configured to perform cross convolution processing on the twelfth output result to obtain a thirteenth output result;

a seventh convolution processing module, configured to perform convolution processing with a convolution kernel 3*3 on the thirteenth output result, so as to obtain a fourteenth output result;

the second discarding module is used for processing the neighboring region of the discarding layer characteristic diagram on the fourteenth output result to obtain a fifteenth output result;

an eighth convolution processing module, configured to perform convolution processing with a convolution kernel 3*3 on the fifteenth output result, so as to obtain a sixteenth output result;

the global pooling module is used for performing global pooling processing on the sixteenth output result to obtain a seventeenth output result;

the full connection module is used for fully connecting the seventeenth output result to obtain an eighteenth output result;

The tiling module is used for tiling the eighteenth output result to obtain a nineteenth output result;

a fourth splicing module, configured to splice the nineteenth output result and the sixteenth output result to obtain a twentieth output result;

and a ninth convolution processing module, configured to perform convolution processing with convolution kernels 1*8 and 8*1 on the twentieth output result, so as to obtain a sample output result.

In one embodiment, the second convolution processing module includes:

the primary convolution sub-module is used for carrying out convolution processing with a convolution kernel of 1*1 on the second output result so as to obtain a primary result;

the secondary convolution sub-module is used for carrying out convolution processing with a convolution kernel of 1*3 on the primary result so as to obtain a secondary result;

the third convolution sub-module is used for carrying out convolution processing with a convolution kernel of 3*1 on the secondary result to obtain a third result;

and the four-time convolution sub-module is used for carrying out convolution processing with a convolution kernel of 1*1 on the three-time result so as to obtain a third output result.

Fig. 10 is a schematic block diagram of a text real-time recognition device 300 according to another embodiment of the present invention. As shown in fig. 10, the text real-time recognition device 300 of the present embodiment is added with the output unit 304 on the basis of the above embodiment.

An output unit 304 for outputting the character sequence.

It should be noted that, as a person skilled in the art can clearly understand the specific implementation process of the text real-time recognition device 300 and each unit, reference may be made to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, the description is omitted here.

The text real-time recognition apparatus 300 described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 11.

Referring to fig. 11, fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a terminal or a server, where the terminal may be an electronic device with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster formed by a plurality of servers.

With reference to FIG. 11, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a text real-time recognition method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a text real-time recognition method.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device 500 to which the present application is applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:

Acquiring image data to be identified;

In one embodiment, when the processor 502 implements the step of training the convolutional neural network by using the identified image data as the sample data, the following steps are specifically implemented:

constructing a loss function and a convolutional neural network;

acquiring image data with a mark to obtain sample data;

In one embodiment, when the step of inputting the sample data into the convolutional neural network to perform convolutional calculation to obtain the sample output result is implemented by the processor 502, the following steps are specifically implemented:

tiling the eighteenth output result to obtain a nineteenth output result;

In one embodiment, when the step of performing the cross convolution processing on the second output result to obtain the third output result, the processor 502 specifically performs the following steps:

In one embodiment, when the step of performing the averaging and pooling processing on the third output result to obtain the fourth output result, the processor 502 specifically performs the following steps:

In one embodiment, after implementing the step of aligning the recognition result using CTC loss function to obtain a character sequence, the processor 502 further implements the following steps:

and outputting the character sequence.

It should be appreciated that in embodiments of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:

acquiring image data to be identified;

In one embodiment, when the processor executes the computer program to implement the step that the text recognition model is obtained by training the convolutional neural network by using the identified image data as sample data, the processor specifically implements the following steps:

Constructing a loss function and a convolutional neural network;

acquiring image data with a mark to obtain sample data;

In one embodiment, when the processor executes the computer program to implement the step of inputting the sample data into the convolutional neural network to perform convolutional calculation so as to obtain a sample output result, the processor specifically implements the following steps:

tiling the eighteenth output result to obtain a nineteenth output result;

In one embodiment, when the processor executes the computer program to implement the step of performing cross convolution processing on the second output result to obtain a third output result, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to implement the step of performing the mean-pooling processing on the third output result to obtain the fourth output result, the method specifically includes the following steps:

In one embodiment, after executing the computer program to implement the step of aligning the recognition result using CTC loss function to obtain a character sequence, the processor further implements the steps of:

and outputting the character sequence.

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The real-time character recognition method is characterized by comprising the following steps of:

acquiring image data to be identified;

the character recognition model is obtained by training a convolutional neural network by taking image data with marks as sample data;

the character recognition model is obtained by training a convolutional neural network by taking image data with a mark as sample data, and comprises the following steps:

constructing a loss function and a convolutional neural network;

acquiring image data with a mark to obtain sample data;

the convolutional neural network is learned by utilizing sample data and adopting a deep learning framework so as to obtain a character recognition model; the step of inputting the sample data into the convolutional neural network to perform convolutional calculation so as to obtain a sample output result comprises the following steps:

performing cross convolution processing on the twenty-first output result to obtain a seventh output result; splicing the seventh output result and the fourth output result to obtain a sixth output result;

tiling the eighteenth output result to obtain a nineteenth output result;

performing convolution processing with convolution kernels 1*8 and 8*1 on the twentieth output result to obtain a sample output result;

and carrying out mean value pooling processing on the third output result to obtain a fourth output result, wherein the method comprises the following steps:

2. The method for recognizing text in real time according to claim 1, wherein the cross-convolution processing is performed on the second output result to obtain a third output result, comprising:

3. The method for recognizing characters in real time according to any one of claims 1 to 2, wherein after aligning the recognition result by using CTC loss function to obtain a character sequence, further comprising:

And outputting the character sequence.

4. The real-time character recognition device is characterized by comprising:

the data acquisition unit is used for acquiring image data to be identified;

an alignment unit for aligning the recognition result by using a CTC loss function to obtain a character sequence;

the apparatus further comprises:

the training unit is used for training the convolutional neural network by taking the image data with the marks as sample data so as to obtain a character recognition model;

the training unit includes:

The learning subunit is used for learning the convolutional neural network by using the sample data and adopting a deep learning framework so as to obtain a character recognition model;

the computing subunit includes:

a ninth convolution processing module, configured to perform convolution processing with convolution kernels 1*8 and 8*1 on the twentieth output result, so as to obtain a sample output result;

5. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-3.

6. A storage medium storing a computer program which, when executed by a processor, performs the method of any one of claims 1 to 3.