CN116563836A

CN116563836A - Text recognition method

Info

Publication number: CN116563836A
Application number: CN202310402464.4A
Authority: CN
Inventors: 许玉辉
Original assignee: Nanjing Kuangyun Technology Co ltd; Beijing Kuangshi Technology Co Ltd
Current assignee: Nanjing Kuangyun Technology Co ltd; Beijing Kuangshi Technology Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-08-08

Abstract

The application discloses a text recognition method, which comprises the following steps: aiming at the single character image, a first recognition result is obtained through a first text recognition model, and a second recognition result is obtained through a second text recognition model; if the first recognition result is the same as the second recognition result, outputting the first recognition result or the second recognition result; if the first recognition result is different from the second recognition result and the first recognition result is a rarely used word, outputting the first recognition result. The method and the device can adopt a single character recognition mode, omits line detection and segmentation links with higher realization difficulty, and solves the problem of recognition accuracy reduction caused by complex shooting environment and irregular shooting. In addition, the method can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

Description

Text recognition method

Technical Field

The present disclosure relates to the field of text processing technology, and in particular, to a text recognition method, a computer readable storage medium, an electronic device, and a computer program product.

Background

Text recognition is an important technical means in the field of text processing, and can enable a computer to recognize characters in an image containing character content, and the computer can process the recognized characters after text recognition, so that the efficiency of text automatic processing is improved.

At present, text recognition can be realized by adopting a line recognition mode, specifically, the line recognition mode can firstly detect the area where a single line of characters are located in an image, then cut and extract the area where the single line of characters are located, and then adopt a line character recognition algorithm to recognize characters in the area where the single line of characters are located.

However, in the current scheme, because the image to be identified is often obtained by shooting under a complex shooting environment, the difficulty of identifying the image to be identified is higher, and further the text identification precision is reduced.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a computer readable storage medium, electronic equipment and a computer program product, so as to achieve the aim of improving text recognition accuracy in a single-character recognition mode.

According to a first aspect of the present application, a text recognition method is disclosed, comprising:

acquiring a character image containing character content;

a single character image is obtained by segmentation from the character image; each of the single character images contains one character;

inputting the single character image into a first text recognition model to obtain a first recognition result of each single character image output by the first text recognition model, and inputting the single character image into a second text recognition model to obtain a second recognition result of each single character image output by the second text recognition model; the first text recognition model is used for recognizing the rarely used words, and is updated continuously according to the updating of the rarely used words library;

for each single character image, if the corresponding first recognition result is the same as the second recognition result, taking the first recognition result or the second recognition result as a target recognition result of the single character image;

and if the corresponding first recognition result is different from the second recognition result and the first recognition result indicates that the single character in the single character image is an uncommon word, taking the first recognition result as a target recognition result of the single character image.

According to a second aspect of the present application, an electronic device is disclosed, comprising: a memory, a processor and a program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the text recognition method as in the first aspect.

According to a third aspect of the present application, a computer readable storage medium is disclosed, having stored thereon a program which, when executed by the processor, implements the steps of the text recognition method as in the first aspect.

According to a fourth aspect of the present application, a computer program product is disclosed, comprising a computer program which, when executed by a processor, implements the steps of the text recognition method as in the first aspect.

In the embodiment of the application, the text recognition is performed by adopting the single character recognition mode different from the line recognition mode, and compared with the line recognition mode, the text recognition can be performed by adopting the first text recognition model and the second text recognition model of the single character recognition mode, and the line detection and segmentation links with higher realization difficulty are omitted in the whole process, so that the problem of reduction of recognition precision caused by complex shooting environment and irregular shooting is solved. In addition, the first text recognition model has the capability of recognizing the rarely used word, when the model is deployed online, the first text recognition model and the second text recognition model which does not update the rarely used word can be deployed at the same time, the first text recognition model and the second text recognition model are adopted to respectively carry out text recognition on the same input content at the same time, whether the two models recognize the content is different or not is judged, any recognition result is output if the two models recognize the content is not different, if the two models recognize the content is different, whether the recognition content belongs to the rarely used word is further judged, the recognition result of the second text recognition model is output if the two models do not belong to the rarely used word, and the recognition result of the first text recognition model is output if the two models recognize the content belongs to the rarely used word. Therefore, the method and the device can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

Drawings

FIG. 1 is a flow chart of a text recognition method of some embodiments of the present application;

FIG. 2 is a schematic diagram of comparison logic of a different recognition result in some embodiments of the present application;

FIG. 3 is a schematic diagram of an overall flow of text recognition in accordance with some embodiments of the present application;

FIG. 4 is a flow chart of a method of updating a text recognition model in accordance with some embodiments of the present application;

FIG. 5 is a particular text recognition method of some embodiments of the present application;

FIG. 6 is a schematic illustration of a model structure of some embodiments of the present application;

FIG. 7 is a schematic illustration of an overall training image of some embodiments of the present application;

FIG. 8 is a schematic diagram of a single character update image according to some embodiments of the present application;

FIG. 9 is a schematic diagram of a text recognition device of some embodiments of the present application;

fig. 10 is a block diagram of an electronic device of some embodiments of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments and that the acts referred to are not necessarily required by the embodiments of the present application.

In recent years, technology research such as computer vision, deep learning, machine learning, image processing, image recognition and the like based on artificial intelligence has been advanced significantly. Artificial intelligence (AI, artificial Intelligence) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human intelligence. The artificial intelligence discipline is a comprehensive discipline and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning, neural networks and the like. Computer vision is an important branch of artificial intelligence, and specifically, machine recognition is a world, and computer vision technologies generally include technologies such as face recognition, vehicle path planning, fingerprint recognition and anti-counterfeit verification, biometric feature recognition, face detection, pedestrian detection, object detection, pedestrian recognition, image processing, image recognition, image semantic understanding, image retrieval, word recognition, video processing, video content recognition, behavior recognition, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM, simultaneous localization and mapping), computational photography, robot navigation and positioning, and the like. With research and progress of artificial intelligence technology, the technology expands application in various fields, such as security protection, city management, traffic management, building management, park management, face passing, face attendance, logistics management, warehouse management, robots, intelligent marketing, computed photography, mobile phone images, cloud services, intelligent home, wearing equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, personnel verification, intelligent screen, intelligent television, camera, mobile internet, network living broadcast, beauty, make-up, medical beauty, intelligent temperature measurement and the like.

The embodiment of the application is mainly applied to the field of text recognition for images, and text recognition refers to a process that an electronic device acquires an image containing character content and then translates the shape in the image into computer text by using a character recognition method. For example, the text data is scanned to obtain an image file, and then the image file is analyzed and processed to obtain text and layout information.

Of course, the foregoing is merely an exemplary listing of possible scenarios of the methods provided by the embodiments of the present application, and is not meant to limit the embodiments of the present application.

In the text recognition of the related technology, a line recognition mode is adopted, namely, firstly, line detection and segmentation extraction are carried out on the region where each line of characters in an image to be recognized is located, and then, text recognition is carried out on each line of character region obtained through segmentation, so that a character recognition result is obtained. For example, the content of the document image is divided according to the rows, and the document image respectively comprises a name row, a card number row, a date row and the like, and related technologies can firstly obtain the area of each row of characters through row detection and segmentation, and then perform text recognition on the area of each row of characters to obtain a character recognition result.

However, in the related art, in the process of capturing an image to be identified, the image quality of the image to be identified obtained by capturing is low (such as the condition of blurring, shading, shielding, reflecting and the like in the image) due to the fact that the complex capturing environment and the non-uniformity of capturing modes are limited, so that the difficulty of line identification is greatly increased, and the identification precision of characters is reduced.

In addition, the text recognition needs to update the model frequently to recognize new rarely used words, but the rarely used words have more near words, so that the recognition accuracy of the near words is reduced after the model updates the rarely used words.

The embodiment of the application provides a text recognition method, which adopts a single character recognition mode different from line recognition to perform text recognition, specifically, a whole-graph training image containing a plurality of character maps is firstly constructed, a single character update image containing single characters is obtained by segmentation from the whole-graph training image, and finally, an initial model is trained by utilizing the single character update image and marking information of the single character update image to obtain a first text recognition model. Compared with a line recognition mode, the method and the device can perform text recognition through the first text recognition model trained by adopting the single character recognition mode, and the line detection and segmentation links with higher realization difficulty are omitted in the whole process, so that the problem of recognition accuracy reduction caused by complex shooting environment and irregular shooting is solved.

In addition, in the embodiment of the application, updated rarely used words can be introduced in the process of training the first text recognition model, so that the first text recognition model has the capacity of recognizing rarely used words, when the models are deployed online, the first text recognition model and a second text recognition model which does not update rarely used words can be deployed simultaneously, the first text recognition model and the second text recognition model are adopted to respectively conduct text recognition on the same input content at the same time, whether the two models are different or not is judged, if no difference exists in the two models, any recognition result is output, if yes, whether the recognition content belongs to rarely used words is further judged, if not rarely used words, the recognition result of the second text recognition model is output, and if yes, the recognition result of the first text recognition model is output. Therefore, the method and the device can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

The embodiment of the application provides a text recognition method. Referring to fig. 1, the text recognition method includes steps 101-106.

Step 101, acquiring a character image containing character content.

In the embodiment of the application, the character image may be an image to be recognized by characters, and may include character content. For example, a certificate image shot for a certificate can be used as a character image, wherein the character content of the contained certificate is the content to be identified; the document image shot for the paper document can be used as a character image, wherein the content of the document characters contained in the document image is the content to be identified.

Step 102, obtaining single character images from the character images in a segmentation mode, wherein each single character image comprises a character.

In the embodiment of the application, since the single character recognition mode is adopted in the scheme, after the character image is obtained, the single character image can be obtained by segmentation from the character image, and then the single character image is input into a subsequent model for text recognition, and one single character image can contain the font content of one character.

Specifically, a single character image is obtained by segmentation from a character image. The method comprises the steps of firstly detecting and segmenting the region (such as paragraphs and lines) where the character is located from the character image by adopting a region feature detection technology, so that interference caused by non-character regions is eliminated as much as possible, then detecting the region where the single character is located from the region where the character is located, segmenting the region where the single character is located, and obtaining the single character image, wherein the identification segmentation of the region where the single character is located is only one preliminary identification segmentation, and further determining the region where the single character is located in the single character image by a subsequent text identification model.

Step 103, inputting the single character image into a first text recognition model to obtain a first recognition result of each single character image output by the first text recognition model, and inputting the single character image into a second text recognition model to obtain a second recognition result of each single character image output by the second text recognition model.

The first text recognition model is used for recognizing the rarely used words, and is updated continuously according to updating of the rarely used word stock.

In the related art, the text recognition needs to update the model frequently to recognize new rarely used words, but the rarely used words have more near words, so that the recognition accuracy of the near words is reduced after the model updates the rarely used words.

In the embodiment of the application, updated rarely used words can be introduced in the process of training the first text recognition model, so that the first text recognition model has the capacity of recognizing the rarely used words, the first text recognition model and a second text recognition model which does not update the rarely used words can be deployed simultaneously when the models are deployed online, the first text recognition model and the second text recognition model are adopted to respectively perform text recognition on the same input content, the first text recognition model can output a first recognition result of each single character image, and the second text recognition model can output a second recognition result of each single character image. The problem that recognition accuracy of the shape near word of the uncommon word is reduced after the model recognizes the uncommon word can be solved by comparing the first recognition result with the second recognition result.

Step 104, regarding each single character image, if the corresponding first recognition result is the same as the second recognition result, using the first recognition result or the second recognition result as a target recognition result of the single character image.

Step 105, if the corresponding first recognition result is different from the second recognition result and the first recognition result indicates that the single character in the single character image is a rare word, taking the first recognition result as a target recognition result of the single character image.

Optionally, the method may further include:

and 106, if the first recognition result and the second recognition result of the single character image are different, and the first recognition result indicates that the single character in the single character image is not a uncommon word, taking the second recognition result as the target recognition result of the single character image.

In the embodiment of the present application, referring to fig. 2, for steps 104-106, a comparison logic diagram of a different recognition result is shown, where one branch is: when the first recognition result and the second recognition result of the single character image are the same, the first recognition result or the second recognition result can be used as the target recognition result of the single character image, wherein the first recognition result and the second recognition result of the single character image are consistent, and the current recognition result is considered to be the final recognition result.

The other branch is: when the first recognition result and the second recognition result of the single character image are different, and the first recognition result indicates that the single character in the single character image is an uncommon word, the first recognition result is used as a target recognition result of the single character image, and the situation means that the first text recognition model recognizes that the current character is an uncommon word through the uncommon word recognition capability, the first recognition result can be used as a final recognition result, and the second text recognition model does not have the uncommon word recognition capability, so that the second recognition result can be discarded.

The other branch is: under the condition that the first recognition result and the second recognition result of the single character image are different, and the first recognition result indicates that the single character in the single character image is not an uncommon character, the second recognition result is used as a target recognition result of the single character image, wherein the condition is that the first text recognition model does not recognize that the current character is an uncommon character, and the second recognition result recognized by the second text recognition model is different from the first recognition result, the probability that the current character is a shape close to the uncommon character is larger, and the second text recognition model does not have the uncommon character recognition capability, so that the recognition accuracy of the shape close to the uncommon character is higher, the second recognition result can be used as a final recognition result, and the first recognition result can be discarded.

In the embodiment of the application, further referring to fig. 3, a text recognition overall flow diagram is shown, including S1, synthesizing an overall training image including a character map; s2, constructing an initial model framework; s3, training an initial model to obtain a first text recognition model; s4, comparing a first recognition result of the first text recognition model with a second recognition result of the second text recognition model aiming at the same input data; when the first identification result is consistent with the second identification result, executing S5, and outputting the first identification result or the second identification result; when the first recognition result is inconsistent with the second recognition result and the character is not a rare word, executing S6 and outputting the second recognition result; and when the first recognition result is inconsistent with the second recognition result and the character is a rare word, executing S7 and outputting the first recognition result.

In summary, according to the text recognition method provided by the embodiment of the application, the text recognition is performed by adopting the single character recognition mode different from the line recognition mode, and compared with the line recognition mode, the text recognition can be performed by adopting the first text recognition model and the second text recognition model of the single character recognition mode, and the line detection and segmentation links with higher realization difficulty are omitted in the whole process, so that the problem of reduction of recognition precision caused by complex shooting environment and irregular shooting is solved. In addition, the first text recognition model has the capability of recognizing the rarely used word, when the model is deployed online, the first text recognition model and the second text recognition model which does not update the rarely used word can be deployed at the same time, the first text recognition model and the second text recognition model are adopted to respectively carry out text recognition on the same input content at the same time, whether the two models recognize the content is different or not is judged, any recognition result is output if the two models recognize the content is not different, if the two models recognize the content is different, whether the recognition content belongs to the rarely used word is further judged, the recognition result of the second text recognition model is output if the two models do not belong to the rarely used word, and the recognition result of the first text recognition model is output if the two models recognize the content belongs to the rarely used word. Therefore, the method and the device can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

Referring to fig. 4, a flowchart of a method for updating a text recognition model according to an embodiment of the present application is shown. As shown in fig. 4, steps 201-203 are included.

Step 201, constructing an entire training image containing a plurality of character maps, wherein each character map represents a character, and the character represented by the character map comprises a rarely used word to be updated.

The character map has corresponding position information in the whole map training image; the character map also includes a rarely used character map.

In the embodiment of the present application, training the first text recognition model first requires to prepare training data including rarely used words to be updated, and since the embodiment of the present application adopts a single character recognition mode to update the model, the training data may include a large number of single character images, in which the single character images corresponding to the rarely used words are included, and in order to generate more single character images conveniently and rapidly, the embodiment of the present application may use a mode of constructing an entire graph training image including a plurality of character maps, in which the entire graph training image includes character maps corresponding to the rarely used words to be updated, and then intercepting a large number of single character images from the entire graph training image to obtain the training data.

Specifically, because the cost of obtaining real data is high, the embodiment of the application can generate the whole-image training image in a synthesis mode, in the process of constructing the whole-image training image, a blank background image can be firstly set, then, according to a preset character list containing rarely used words, some character maps are randomly generated, the character maps are in one-to-one correspondence with the characters, after the character maps are added on the blank background image, the whole-image training image is obtained, and the number of the whole-image training images can be generated. In addition, the corresponding position information of the character map in the whole map training image can be obtained for subsequent training. Wherein the map is a two-dimensional image at a different layer from the background image, and the map can be added to the background image, thereby synthesizing an entire map image with map content.

It should be noted that, in order to enable the first text recognition model to have the function of recognizing the updated uncommon word, the embodiment of the application adds the new rarely-used word paste graph generated according to the updated rarely-used word into the whole graph training image, and then the association between the word shape of the uncommon word in the model learning image and the recognition result of the uncommon word can be enabled during subsequent training, so that the first text recognition model can recognize the uncommon word.

Furthermore, according to the embodiment of the application, some data amplification operations can be performed on the whole-image training image, so that the number of the whole-image training image is increased, the sample number of the training data is increased, the authenticity of the whole-image training image can be increased through the data amplification operations, and the subsequent training quality is improved.

Step 202, according to the position information of the character map, a corresponding single character updated image is segmented from the whole map training image, and a corresponding annotation recognition result is added for the single character updated image.

In the embodiment of the application, because the position information of the character map in the whole-map training image is recorded in the process of generating the whole-map training image, the embodiment of the application can randomly segment the corresponding single-character updating image from the whole-map training image according to the position information of the character map to obtain a plurality of single-character updating images, and adds the corresponding labeling recognition result for the single-character updating image.

The size of the single character image of a character can be slightly smaller than the size of the character map of the character, so that the cut single character image eliminates the interference around the character content in the image as much as possible, the sample quality is improved, in addition, the labeling recognition result added to the single character updating image can be the correct recognition type of the character corresponding to the single character updating image, for example, the labeling recognition result corresponding to the single character updating image intercepted for the character 'day' is 'day', the labeling recognition result is used for calculating a loss value by combining the output value of the model in the process of training the model later, and the loss value can be combined with a loss function to adjust the model parameters.

For example, adding a character map to a background image of 192×192 size can obtain an entire-map training image of 192×192 size, and segmenting the entire-map training image can obtain a single character update image of 64×64 size.

And 203, training the first text recognition model by using the single character updating image, the position information of the single character updating image in the whole image training image and the labeling recognition result of the single character updating image to obtain an updated first text recognition model.

The first text recognition model is used for determining a recognition result of a single character containing the uncommon word in the whole image.

In this embodiment of the present application, a piece of training data may be constructed as a correspondence, where the correspondence is constructed by a single character update image, position information of the single character update image in the whole image training image, and a labeling recognition result of the single character update image, after a piece of training data is input into an initial model, the initial model outputs an output result for the single character update image, and the output result may be combined with the labeling recognition result of the single character update image to perform calculation of a loss value, and the loss value may be combined with a loss function to perform adjustment of model parameters. In addition, after one piece of training data is input into the initial model, the initial model can also obtain a matrix vector corresponding to the area where the character content in the single character updated image is located based on the single character updated image, the matrix vector represents the position information of the area where the character content is located.

The embodiment of the application provides a method for updating a text recognition model, which adopts a single character recognition mode different from line recognition to perform text recognition, specifically, a whole-graph training image containing a plurality of character mapping is firstly constructed, a single character update image containing single characters is obtained by segmentation from the whole-graph training image, and finally, the initial model is trained by utilizing the single character update image and marking information of the single character update image to obtain a first text recognition model. Compared with a line recognition mode, the method and the device can perform text recognition through the first text recognition model trained by adopting the single character recognition mode, and the line detection and segmentation links with higher realization difficulty are omitted in the whole process, so that the problem of recognition accuracy reduction caused by complex shooting environment and irregular shooting is solved.

The embodiment of the application provides a specific text recognition method. Referring to fig. 5, the method includes steps 301-309.

Step 301, acquiring a character image containing character content.

The step may refer to step 101, and will not be described herein.

Step 302, obtaining a single character image by segmentation from the character image; each of the single character images contains one character.

The step may refer to step 102, and will not be described herein.

Wherein the first text recognition model comprises: a feature extraction layer, an affine transformation layer and a prediction layer.

Step 303, downsampling the single character image to obtain downsampled features.

And step 304, inputting the single character image into a feature extraction layer of the first text recognition model to obtain homography matrix features of the single character image.

Step 305, inputting the homography matrix feature of the single character image and the downsampled feature of the single character image into the affine transformation layer to obtain the affine transformation matrix feature of the single character image.

Wherein the affine transformation matrix feature is used to characterize the positional information of the character areas in the single character image.

Optionally, the feature extraction layer comprises four 2 times downsampling layers and two full-connection layers which are sequentially connected; the prediction layer comprises three 2 times downsampling layers and two full-connection layers which are sequentially connected.

And 306, inputting affine transformation matrix characteristics of the single character image into the prediction layer to obtain a first recognition result of the single character image.

In this embodiment of the present application, for steps 303-306, to adapt to the single character recognition manner in this embodiment of the present application, the model structure of the first text recognition model may be designed to be three layers: the feature extraction layer is specifically used for extracting homography matrix features of an input single character image, and the homography matrix features are used for representing a transformation relationship between two planes, namely representing a transformation relationship of projection of the single character image from a world coordinate system to a pixel coordinate system, and can be simply understood as: homography matrix features characterize the grapheme characteristics of a single character image after conversion from three-dimensional capture space to a two-dimensional pixel plane.

The affine transformation layer is used for carrying out affine transformation operation on the down-sampling characteristics of the single character image (the down-sampling characteristics are obtained by down-sampling the single character image), the resolution of the single character image can be reduced by the down-sampling, the subsequent processing difficulty is reduced by down-sampling dimension reduction on the basis of ensuring the characteristic precision, so as to obtain affine transformation matrix characteristics, the affine transformation refers to the process that in geometry, one vector space carries out linear transformation and then translates up to another vector space, the affine transformation can keep the flatness (namely, straight line or straight line cannot be bent after transformation, circular arc or circular arc) and parallelism (actually refers to the fact that the relative position relation between two-dimensional images is unchanged, parallel line or parallel line is constant, and the intersecting angle of the intersecting straight line) of the single character image, in the embodiment of the affine transformation matrix characteristics can be obtained after affine transformation operation on the down-sampling characteristics of the single character image based on the single character matrix characteristics, the affine transformation can be used for representing the position information of characters in the single character image, the position information can be expressed as a rectangle in the image, namely, the position information can be used for extracting the specific characters in the single character image, namely, the relative precision of the single character is reduced, and the relative precision is further improved, and the relative precision of the single character is further improved.

Finally, the prediction layer can obtain the recognition result of the characters in the single-character image by carrying out downsampling and full-connection processing on the affine transformation matrix characteristics.

In the reasoning process, after a single character image is input into a first text recognition model, the first text recognition model outputs affine transformation matrix characteristics of the minimum circumscribed rectangle of the characterization characters, and then a predicted recognition result is obtained through affine transformation matrix characteristic prediction.

In the embodiment of the application, referring to fig. 6, a schematic diagram of a model structure is shown, where a first text recognition model has a feature extraction layer, an affine transformation layer, and a prediction layer, and the feature extraction layer includes four 2 times downsampling layers and two fully-connected layers that are sequentially connected; the prediction layer comprises three 2 times downsampling layers and two full-connection layers which are sequentially connected.

Based on fig. 6, when a single character image of size 3×64×64 is input, the first downsampling layer of the feature extraction layer performs downsampling processing on the single character image to output features of size 4×32×32; the second layer downsampling layer of the feature extraction layer continues downsampling processing on the features with the size of 4 multiplied by 32, and outputs the features with the size of 16 multiplied by 16; the third downsampling layer of the feature extraction layer continues downsampling processing on the features with the size of 16 multiplied by 16, and outputs the features with the size of 16 multiplied by 8; the fourth downsampling layer of the feature extraction layer continues downsampling processing on the features with the size of 16 multiplied by 8, and outputs the features with the size of 24 multiplied by 4; the first full-connection layer of the feature extraction layer continuously carries out full-connection processing on the features with the size of 24 multiplied by 4, and outputs the features with the size of 80; and the second full-connection layer of the feature extraction layer continuously carries out full-connection processing on the features with the size of 80, and outputs homography matrix features with the size of 3 multiplied by 3.

Further, after the down-sampling feature of 4×32×32 size of the down-sampled homography matrix feature of 3×3 size and the single character image is processed by the affine transformation layer, affine transformation matrix feature of 4×32×32 size can be obtained; the first downsampling layer of the prediction layer performs downsampling processing on affine transformation matrix features with the size of 4×32×32 and outputs features with the size of 32×16×16; the second layer downsampling layer of the prediction layer continues downsampling processing on the features with the size of 32 multiplied by 16, and outputs the features with the size of 48 multiplied by 8; the third layer downsampling layer of the prediction layer continues downsampling processing on the features with the size of 48 multiplied by 8, and outputs the features with the size of 96 multiplied by 4; the first full-connection layer of the prediction layer continuously carries out full-connection processing on the features with the size of 96 multiplied by 4, and outputs the features with the size of 640; and the second full-connection layer of the prediction layer continuously carries out full-connection processing on the 640-size features, and outputs 8992-size prediction recognition results.

Step 307, inputting the single character image into a second text recognition model, and obtaining a second recognition result of each single character image output by the second text recognition model.

This step may refer to step 103, and will not be described herein.

Step 308, regarding each single character image, if the corresponding first recognition result is the same as the second recognition result, using the first recognition result or the second recognition result as a target recognition result of the single character image.

This step may refer to step 104, and will not be described herein.

Step 309, if the corresponding first recognition result is different from the second recognition result, and the first recognition result indicates that the single character in the single character image is a rare word, the first recognition result is used as the target recognition result of the single character image.

This step may refer to step 105, and is not described herein.

Step 310, if the first recognition result and the second recognition result of the single-character image are different, and the first recognition result indicates that the single character in the single-character image is not a uncommon word, the second recognition result is used as the target recognition result of the single-character image.

This step may refer to step 106, and will not be described herein.

Optionally, the first text recognition model is updated by:

Step 311, obtaining a rarely used word update set.

Step 312, constructing an entire training image containing the character map according to the rarely used word updating set; wherein the character map comprises a rarely used character map; the uncommon word stock map corresponds to uncommon words in the uncommon word update set.

Step 313, according to the position information of the character map in the whole-map training image, segmenting the whole-map training image to obtain a plurality of single-character updated images, and adding corresponding labeling recognition results for the single-character updated images.

In this embodiment of the present application, referring to fig. 7, for an entire training image with a character map 20, according to the position information of the character map 20 in the entire training image, the embodiment of the present application may randomly segment a corresponding single character update image containing character content from the entire training image. For example, referring further to fig. 8, which shows a schematic diagram of a single character update image, assuming that the size of the whole training image of fig. 7 is 192×192, a corresponding single character update image 30 of 64×64 size can be obtained by slicing the character "cloud" in the whole training image.

And step 314, updating the first text recognition model by using the label recognition result comprising the single character updating image, the position information corresponding to the character map and the single character updating image.

Specifically, in the update training process of the first text model, in order to improve the update quality of the model, two training loss values may be set in the embodiment of the present application, one is a first loss value calculated by the position information represented by the affine transformation matrix characteristics and the position information of the input single character update image in the whole image training image, and the other is a second loss value calculated according to the prediction recognition result and the labeling recognition result of the single character update image, where the model parameter training based on the first loss value may effectively improve the extraction precision of the model to the minimum circumscribed rectangular area where the characters in the single character image are located, and further may indirectly improve the subsequent recognition precision; model parameter training based on the second loss value can directly improve the recognition accuracy of characters, so that training quality is improved.

Wherein, model parameter training based on the first loss value may employ an L2 loss function (a regression loss function); model parameter training based on the second loss value may employ a one-hot cross entropy loss function (a loss function based on classification problems, cross entropy loss is used to measure the difference between predicted and true values).

It should be noted that, in the embodiment of the present application, the synthesis of the model training and the whole-image training image may be performed synchronously, and in the embodiment of the present application, the single character update image, the position information of the single character update image in the whole-image training image, and the labeling recognition result of the single character update image are stored in the preset cache area, instead of storing the synthesized whole-image training image, so that the training process may extract the position information of the single character update image, the single character update image in the whole-image training image, and the labeling recognition result of the single character update image and train the initial model, thereby obtaining the first text recognition model, so that the extra time spent in writing data into the memory and reading from the memory may be omitted, the response speed is improved, and the training time is saved.

In the embodiment of the application, on the premise of having 200 ten thousand single character updated images, the batch size (representing the number of images input simultaneously) of training can be set to 2048, and 100 epochs (representing the process of training all samples once) are trained, and each epoch is trained for 1000 steps (training times). The batchsize, epoch, steps may be adjusted based on actual requirements, which is not limited in the embodiments of the present application.

In the embodiment of the application, in particular, in the process of updating the first text recognition model to enable the first text recognition model to recognize new rarely used words, training data is required to be updated first, namely a new whole-image training image containing new rarely used word maps is constructed, a plurality of single-character updated images are obtained by segmentation from the whole-image training image according to position information corresponding to the character maps in the new whole-image training image, and corresponding labeling recognition results are added for the single-character updated images, so that updating of the training data is completed. And then training the first text recognition model by using the position information corresponding to the single character updating image, the character map and the labeling recognition result of the single character updating image to obtain an updated first text recognition model, so that the updated first text recognition model can determine the recognition result of a single rare word in the whole map image.

In the process of updating the first text recognition model, the first text recognition model before updating can be used as a pre-training model, and model parameters of a feature extraction layer and an affine transformation layer in the first text recognition model are frozen, so that the model parameters of a prediction layer are optimized through fine tuning training, the time for updating training can be effectively reduced, and the first text recognition model is prevented from being retrained and generated when the rarely used word is updated each time.

Optionally, step 312 may specifically include:

sub-step 3121, obtaining a character list comprising a plurality of characters, and adding the uncommon words in the uncommon word update set to the character list.

In the embodiment of the application, the characters in the character list are the characters which hope the first recognition model to recognize, different characters can be added in the preset character list according to actual recognition requirements, and when the demand of updating and recognizing the uncommon words is met, the uncommon words in the uncommon word updating set which need to be updated can be added in the character list.

Sub-step 3122, construct a blank background map.

The blank background image may also be a custom image with a blank background and no other features, and it should be noted that the background pattern of the blank background image may have various background images (such as a solid white background, a light gray background, etc. (the background color may be determined based on the actual requirement).

Sub-step 3123, randomly generating a character map comprising individual characters in said character list.

In the embodiment of the present application, since a plurality of different characters are set in the character list, the embodiment of the present application may randomly select a character from the character list at a time, generate a character map corresponding to the character, and finally may obtain a plurality of different character maps.

And a sub-step 3124 of adding the character map to the blank background map to obtain the whole map training image and the position information corresponding to the character map.

In the embodiment of the application, the character map may be randomly added in the blank background map to obtain the whole map training image and the position information corresponding to the character map.

For example, referring to fig. 7, a schematic diagram of an entire training image is shown, which includes a blank background image 10 and character maps 20 corresponding to characters "present", "day", "air", "multiple", "cloud", "turn", "fine", respectively, and after the character maps 20 are generated, the character maps 20 may be added to the blank background image 10, so as to obtain the entire training image, and in the process of adding the character maps 20 to the blank background image 10, the position information of the character maps 20 in the blank background image 10 may be recorded, where the position information may include coordinates of circumscribed frames of character contents in the character maps 20. It should be noted that, the character map 20 may have no corresponding frame, and the frame of the character map 20 shown in fig. 7 is only for indication.

Optionally, the font size of the characters in the character map is within a preset font size range, and the font format of the characters in the character map is any one of a plurality of preset font formats.

In the embodiment of the application, in order to improve the authenticity of the training sample and thus improve the robustness of the subsequent training process, the embodiment of the application can perform corresponding parameter definition for the generating process of the character map, for example, the font size of the characters in the generated character map is set to be in the range of the preset font size, the font format of the characters in the generated character map is set to be any one of a plurality of preset font formats, and the like, so that the randomness of the sample can be increased by such setting, and the more realistic training sample can be obtained.

Optionally, after sub-step 3124, further comprising:

sub-step 3125, performing different image amplification processing on the image obtained after adding the character map, to obtain a plurality of whole-map training images.

Wherein the image amplification process includes: image flipping, image rotation, image warping, image affine transformation, image scaling, image compression, image contrast adjustment, brightness adjustment, chromaticity adjustment, saturation adjustment, color dithering, noise addition, image blurring, image region random erasure, style conversion.

In the embodiment of the application, in order to achieve the training target, a large number of training samples are needed to participate in the training of the model, and a large number of whole-image training images are needed to generate the samples, but because the acquisition cost of a large number of real image data is high and the number of synthesized images is small, the embodiment of the application can increase the number of the training samples by performing image amplification processing on the synthesized whole-image training images.

Optionally, the image amplification process includes: image flipping, image rotation, image warping, image affine transformation, image scaling, image compression, image contrast adjustment, brightness adjustment, chromaticity adjustment, saturation adjustment, color dithering, noise addition, image blurring, image region random erasure, style conversion.

Specifically, image amplification means that the number of original images is not actually increased, but only some transformation is performed on the original images, so that more new images are created. For example, an entire-graph training image synthesized by the embodiment of the application can be subjected to image overturning and image warping operations respectively, and then the generated new graph can be used as a new entire-graph training image; for another Zhang Zheng image training image synthesized by the embodiment of the application, noise can be added to the image training image, a new image generated after the image training image is used as another new whole image training image, the total number of the whole image training images is greatly improved after the original whole image training image is subjected to multiple image amplification processes, and the type and the number of image amplification means for one original whole image training image are not limited.

The embodiment of the application provides a text recognition method, which adopts a single character recognition mode different from line recognition to perform text recognition, and compared with the line recognition mode, the text recognition method can perform text recognition through a first text recognition model and a second text recognition model adopting the single character recognition mode, and the whole process omits line detection and segmentation links with higher realization difficulty, thereby solving the problem of recognition precision reduction caused by complex shooting environment and irregular shooting. In addition, the first text recognition model has the capability of recognizing the rarely used word, when the model is deployed online, the first text recognition model and the second text recognition model which does not update the rarely used word can be deployed at the same time, the first text recognition model and the second text recognition model are adopted to respectively carry out text recognition on the same input content at the same time, whether the two models recognize the content is different or not is judged, any recognition result is output if the two models recognize the content is not different, if the two models recognize the content is different, whether the recognition content belongs to the rarely used word is further judged, the recognition result of the second text recognition model is output if the two models do not belong to the rarely used word, and the recognition result of the first text recognition model is output if the two models recognize the content belongs to the rarely used word. Therefore, the method and the device can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

Fig. 9 is a schematic structural view of a text recognition device of some embodiments of the present application. As shown in fig. 9, the text recognition apparatus may include: a processing module 401 and an acquisition module 402.

The acquisition module 402 is configured to: acquiring a character image containing character content;

the processing module 401 is configured to: a single character image is obtained by segmentation from the character image; each of the single character images contains one character; inputting the single character image into a first text recognition model to obtain a first recognition result of each single character image output by the first text recognition model, and inputting the single character image into a second text recognition model to obtain a second recognition result of each single character image output by the second text recognition model; the first text recognition model is used for recognizing the rarely used words, and is updated continuously according to the updating of the rarely used words library; for each single character image, if the corresponding first recognition result is the same as the second recognition result, taking the first recognition result or the second recognition result as a target recognition result of the single character image; and if the corresponding first recognition result is different from the second recognition result and the first recognition result indicates that the single character in the single character image is an uncommon word, taking the first recognition result as a target recognition result of the single character image.

Optionally, the processing module 401 is further configured to:

and if the first recognition result and the second recognition result of the single character image are different, and the first recognition result indicates that the single character in the single character image is not a rarely used word, the second recognition result is used as the target recognition result of the single character image. Optionally, the font size of the characters in the character map is within a preset font size range, and the font format of the characters in the character map is any one of a plurality of preset font formats.

Optionally, the first text recognition model includes: a feature extraction layer, an affine transformation layer and a prediction layer; the processing module 401 is specifically configured to: downsampling the single character image to obtain downsampling characteristics; inputting the single character image into a feature extraction layer of the first text recognition model to obtain homography matrix features of the single character image; inputting the homography matrix characteristics of the single character image and the downsampling characteristics of the single character image into the affine transformation layer to obtain affine transformation matrix characteristics of the single character image; the affine transformation matrix features are used for representing position information of character areas in the single character image; and inputting affine transformation matrix characteristics of the single character image into the prediction layer to obtain a first identification result of the single character image.

Optionally, the processing module 401 is specifically configured to:

acquiring an updating set of the rarely used words;

constructing an entire drawing training image containing a character map according to the rarely used word updating set; wherein the character map comprises a rarely used character map; the uncommon word paste graph corresponds to uncommon words in the uncommon word update set;

according to the position information of the character map in the whole-map training image, a plurality of single-character updated images are obtained by segmentation from the whole-map training image, and corresponding labeling recognition results are added for the single-character updated images;

and updating the first text recognition model by using the mark recognition result comprising the single character updating image, the position information corresponding to the character map and the single character updating image.

Optionally, the processing module 401 is specifically configured to:

performing different image amplification processing on the image obtained after the character mapping is added, so as to obtain a plurality of whole-image training images; wherein the image amplification process includes: image flipping, image rotation, image warping, image affine transformation, image scaling, image compression, image contrast adjustment, brightness adjustment, chromaticity adjustment, saturation adjustment, color dithering, noise addition, image blurring, image region random erasure, style conversion.

In summary, according to the text recognition device provided by the embodiment of the application, the text recognition is performed by adopting the single character recognition mode different from the line recognition mode, and compared with the line recognition mode, the text recognition can be performed by adopting the first text recognition model and the second text recognition model of the single character recognition mode, and the line detection and segmentation links with higher realization difficulty are omitted in the whole process, so that the problem of reduction of recognition precision caused by complex shooting environment and irregular shooting is solved. In addition, the first text recognition model has the capability of recognizing the rarely used word, when the model is deployed online, the first text recognition model and the second text recognition model which does not update the rarely used word can be deployed at the same time, the first text recognition model and the second text recognition model are adopted to respectively carry out text recognition on the same input content at the same time, whether the two models recognize the content is different or not is judged, any recognition result is output if the two models recognize the content is not different, if the two models recognize the content is different, whether the recognition content belongs to the rarely used word is further judged, the recognition result of the second text recognition model is output if the two models do not belong to the rarely used word, and the recognition result of the first text recognition model is output if the two models recognize the content belongs to the rarely used word. Therefore, the method and the device can accurately identify the rarely used word by utilizing the advantages of the first text identification model, and can accurately identify the near-word of the rarely used word by utilizing the second text identification model, so that the problem that the identification accuracy of the near-word is reduced after the model is more frequently used is solved.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In addition, referring to fig. 10, the embodiment of the present application further provides an electronic device, and the electronic device 700 includes a processor 710, a memory 720, and a computer program stored in the memory 720 and capable of running on the processor 710, where the computer program, when executed by the processor 710, implements the respective processes of the text recognition method embodiment of the foregoing embodiment, and can achieve the same technical effects, and for avoiding repetition, will not be repeated herein.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements each process of the above-described text recognition method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided herein. The computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.

The embodiments of the present application further provide a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the computer program realizes each process of the above embodiments of the text recognition method, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the present application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The text recognition method, the text recognition device, the electronic equipment and the computer storage medium provided by the application are described in detail, and specific examples are applied to illustrate the principles and the implementation of the application, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of text recognition, comprising:

acquiring a character image containing character content;

2. The method according to claim 1, wherein the method further comprises:

and if the first recognition result and the second recognition result of the single character image are different, and the first recognition result indicates that the single character in the single character image is not a rarely used word, the second recognition result is used as the target recognition result of the single character image.

3. The method according to claim 1 or 2, wherein the first text recognition model comprises: a feature extraction layer, an affine transformation layer and a prediction layer;

the step of inputting the single character image into a first text recognition model to obtain a first recognition result of each single character image output by the first text recognition model comprises the following steps:

downsampling the single character image to obtain downsampling characteristics;

inputting the single character image into a feature extraction layer of the first text recognition model to obtain homography matrix features of the single character image;

inputting the homography matrix characteristics of the single character image and the downsampling characteristics of the single character image into the affine transformation layer to obtain affine transformation matrix characteristics of the single character image; the affine transformation matrix features are used for representing position information of character areas in the single character image;

And inputting affine transformation matrix characteristics of the single character image into the prediction layer to obtain a first identification result of the single character image.

4. A method according to claim 3, wherein the feature extraction layer comprises four 2-fold downsampling layers, two fully connected layers connected in sequence;

the prediction layer comprises three 2 times downsampling layers and two full-connection layers which are sequentially connected.

5. The method of claim 1, wherein the first text recognition model is updated by:

acquiring an updating set of the rarely used words;

6. The method of claim 5, wherein constructing an entire drawing training image comprising a character map from the updated set of uncommon words comprises:

acquiring a character list comprising a plurality of characters, and adding the uncommon words in the uncommon word update set into the character list;

constructing a blank background diagram;

randomly generating a character map comprising individual characters in the character list;

and adding the character map to the blank background map to obtain the whole map training image and the position information corresponding to the character map.

7. The method of claim 6, wherein after said adding the character map to the blank background map, said constructing an entire map training image comprising character maps from the updated set of uncommon words further comprises:

8. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the text recognition method according to any one of claims 1 to 7.

9. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the text recognition method according to any one of claims 1 to 7.

10. A computer program product, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the text recognition method according to any of claims 1 to 7.