CN111368841A

CN111368841A - Text recognition method, device, equipment and storage medium

Info

Publication number: CN111368841A
Application number: CN202010127912.0A
Authority: CN
Inventors: 章放; 邹雨晗; 杨海军; 徐倩; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-07-03

Abstract

The invention discloses a text recognition method, a text recognition device, text recognition equipment and a storage medium, wherein the method comprises the following steps: acquiring a text picture to be identified; calling a text feature extraction structure in a text recognition model to extract text features of a text picture to be recognized, wherein the text recognition model is obtained by pre-training; calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized; and calling an identification structure in the text identification model to identify a text identification result of the text picture to be identified based on the text feature and the picture type feature. The method and the device can accurately identify the characters corresponding to the text when the types of the text pictures are different, so that the universality of the text identification model is improved, namely the universality of an OCR model is improved.

Description

Text recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to a text recognition method, apparatus, device, and storage medium.

Background

OCR (Optical Character Recognition) refers to a process of scanning text data, analyzing an image file, and acquiring text and layout information. Current OCR technology typically uses and trains different OCR models for different text image types, such as identification cards, vehicle registration cards, driving licenses, documents, and so on, and typically does not use and train a unified OCR model. Because the current OCR models generally have a relatively good recognition rate for different types of text pictures, a specific OCR model is used and trained for a specific text picture, and whenever there is a new text picture, it is necessary to retrain or even redesign an OCR model to address the newly appeared text picture, that is, the OCR model has poor versatility.

Disclosure of Invention

The invention mainly aims to provide a text recognition method, a text recognition device, text recognition equipment and a storage medium, and aims to solve the problems that an existing OCR model only has a good recognition rate for one type of text pictures and is poor in universality for different types of text pictures.

In order to achieve the above object, the present invention provides a text recognition method, including the steps of:

acquiring a text picture to be identified;

calling a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;

calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;

and calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.

Optionally, the step of calling a picture type feature extraction structure in the text recognition model to extract the picture type feature of the text picture to be recognized includes:

and calling the picture type feature extraction structure to extract comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.

Optionally, the text features include sub-text features respectively corresponding to each width unit of the text picture to be recognized,

the step of calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized across a plurality of width units comprises the following steps:

and calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units.

Optionally, the identification structure includes a first full-link layer and a second full-link layer, and the step of calling the identification structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition includes:

inputting the sub-text features corresponding to a single width unit into the first full-connection layer to obtain a first vector;

inputting the picture type features into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;

multiplying the first vector and the second vector by corresponding elements to obtain a third vector;

and determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.

Optionally, the identifying structure further includes a decoder, and the step of determining the text identification result of the text picture to be identified according to the third vector corresponding to each width unit includes:

and respectively inputting the third vectors corresponding to the width units into the decoder to obtain target characters corresponding to the width units, and taking the target characters as text recognition results of the text pictures to be recognized.

Optionally, the step of calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain a comprehensive feature of the text picture to be recognized across multiple width units includes:

and calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain the comprehensive feature of the text picture to be recognized across multiple width units.

Optionally, the step of acquiring a text picture to be recognized includes:

calling a preset text detection model to detect the text position in the picture to be recognized;

and cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.

In order to achieve the above object, the present invention further provides a text recognition apparatus, including:

the acquisition module is used for acquiring a text picture to be identified;

the text feature extraction module is used for calling a text feature extraction structure in a text recognition model to extract the text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;

the picture type feature extraction module is used for calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized;

and the recognition module is used for calling a recognition structure in the text recognition model to recognize and obtain a text recognition result of the text picture to be recognized based on the text feature and the picture type feature.

In order to achieve the above object, the present invention also provides a text recognition apparatus, including: a memory, a processor and a text recognition program stored on the memory and executable on the processor, the text recognition program when executed by the processor implementing the steps of the text recognition method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having a text recognition program stored thereon, which when executed by a processor implements the steps of the text recognition method as described above.

In the invention, the text recognition model is trained in advance, the text feature extraction structure in the text recognition model is called to extract the text feature of the text picture to be recognized, the picture type feature extraction structure in the text recognition model is called to extract the picture type feature of the text picture to be recognized, and the recognition structure is called to recognize the text recognition result of the text picture to be recognized based on the text feature and the picture type feature, so that the picture type feature also becomes a factor influencing the recognition result, and therefore, when the types of the text pictures are different, the fonts, the backgrounds and the like of the text pictures are different, the characters corresponding to the text can be accurately recognized, the universality of the text recognition model is improved, and the universality of the OCR model is also improved. Due to the fact that the universality of the text recognition model is improved, the model does not need to be trained respectively aiming at various types of text pictures, and only one universal model needs to be trained, so that the model training efficiency is improved, and the universality of training data is also improved.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text recognition method according to a first embodiment of the present invention;

FIG. 3 is a block diagram of a text recognition apparatus according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, in the embodiment of the present invention, the text recognition device may be a smart phone, a personal computer, a server, and the like, and is not limited specifically here.

As shown in fig. 1, the text recognition apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the device configuration shown in FIG. 1 does not constitute a limitation of text recognition devices, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a text recognition program. Among them, the operating system is a program that manages and controls the hardware and software resources of the device, supporting the execution of the text recognition program and other software or programs.

In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and the processor 1001 may be configured to invoke a text recognition program stored in the memory 1005 and perform the following operations:

acquiring a text picture to be identified;

Further, the step of calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized comprises:

Further, the text features comprise sub-text features respectively corresponding to each width unit of the text picture to be recognized,

Further, the step of calling the recognition structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition includes:

Further, the identification structure further includes a decoder, and the step of determining the text identification result of the text picture to be identified according to the third vector corresponding to each width unit includes:

Further, the step of calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units includes:

Further, the step of acquiring the text picture to be recognized includes:

Based on the above structure, various embodiments of the text recognition method are proposed.

Referring to fig. 2, fig. 2 is a flowchart illustrating a text recognition method according to a first embodiment of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than presented herein. The executing subject of each embodiment of the text recognition method of the present invention may be a device such as a smart phone, a personal computer, and a server, and for convenience of description, the executing subject is omitted in the following embodiments for illustration. In this embodiment, the text recognition method includes:

step S10, acquiring a text picture to be recognized;

the text picture may be a picture in which text (characters, numbers, characters, or the like, hereinafter collectively referred to as characters) exists in the picture, and may be a picture obtained by shooting or scanning various certificates or paper documents. The text picture to be recognized is the text picture of the text in the picture to be recognized. It should be noted that the text picture to be recognized may be a small picture containing one, one or one column of text, for example, a small picture containing one row of numbers that is cut out by the user through a picture editing tool; it may also be a large picture containing multiple rows or columns, for example, a picture obtained by scanning a paper document; the image may also include a text region and a non-text region, for example, a picture obtained by scanning the front surface of the identification card, where the text region includes "name", "gender", and the like, and also includes a blank non-text region.

The method for acquiring the text picture to be recognized is various, and may be different according to different application scenarios, for example, the text picture may be acquired by acquiring a pre-stored text picture, scanning a paper document, or shooting a certificate.

Further, step S10 includes:

step S101, calling a preset text detection model to detect a text position in a picture to be recognized;

and S102, cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.

The process of acquiring the text picture to be recognized may be: for a picture to be recognized containing a text region and a non-text region, a preset text detection model can be called to detect the text position in the picture to be recognized. The text position is the position of the text area in the picture to be identified; the text detection model may adopt an existing commonly used text detection model, which is not described herein again. And then cutting the text picture to be recognized according to the text position, and cutting the text area to obtain the text picture to be recognized containing one line or one column of text characters.

Step S20, calling a text feature extraction structure in a text recognition model to extract the text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;

in this embodiment, after the text picture to be recognized is obtained, a text recognition model may be used to perform text recognition on the text picture to be recognized. Wherein, the text recognition model is trained in advance. Specifically, a large number of text pictures and labels of texts existing in the text pictures can be used to train the text recognition model, and the specific training mode can be a training mode of an existing machine learning model, which is not described in detail herein. It should be noted that the image types of the text images adopted for training the text recognition model may be multiple, and when different types of text images are adopted for training the text recognition model, the text recognition model obtained through training can have a better recognition effect on the different types of text images, so that the universality of the text recognition model is improved. The picture types can be distinguished according to the types of texts in the pictures, such as Chinese, English, numbers and the like, or can be distinguished according to the obtaining ways of the pictures, such as pictures obtained by scanning paper documents and identity documents, which belong to different types.

In this embodiment, the text recognition model may include a text feature extraction structure, a picture type feature extraction structure, and a recognition structure. The text feature extraction structure is used for extracting text features in the text picture, and it is understood that the text features represent features of texts in the text picture, and obviously different texts have different features, for example, the shapes of the numbers 1 and 2 are obviously different. The extracted text features may be represented in any form, such as vectors, matrices, etc., where the elements may represent the feature representation of the text in the text image in various aspects, such as shape, color, or size, etc. The text feature extraction structure may include at least a backbone network portion, and the backbone network may adopt some common feature extraction networks, such as VGG (Visual Geometry Group) model, ResNet (residual network), densnet, and the like. The text feature extraction structure may further include a head Network connected behind the backbone Network, and the head Network may adopt a Neural Network with a cyclic property, such as RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory ), bilst (Bi-directional Long Short-Term Memory), gru (gated corrected current unit), or blank, and because the text in the text picture has context correlation, the extracted text feature is more accurate by processing the continuous input data by the head Network.

And calling a text feature extraction structure in the text recognition model to extract the text features of the text picture to be recognized, specifically, inputting the text picture into the text feature recognition structure, and outputting the text features of the text picture to be recognized by the text feature recognition structure.

Step S30, calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized;

the picture type feature extraction structure may be a picture type feature for extracting a text picture, and it is understood that the picture type feature represents a picture feature of the text picture, and obviously different types of picture types have different features, for example, picture background features of text pictures obtained by scanning different documents are different. The specific representation form of the extracted image type feature is not limited, and may be, for example, a vector or a matrix, etc., where elements of the vector or the matrix may represent the type feature representation of the text image in various aspects, such as image background, source, etc. The picture type feature extraction structure can adopt a common neural network structure; the picture type feature extraction structure can be set independently from the text feature extraction structure, and the text picture is taken as input to extract the picture type feature of the whole text picture; or the text features output by the text feature recognition structure are used as input, and the picture type features of the text picture are extracted based on the text features.

And calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized. Specifically, the text features of the text picture to be recognized may be input into the picture type feature extraction structure, and the picture type feature extraction structure outputs the picture type features of the text picture to be recognized.

And step S40, calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.

And calling an identification structure in the text identification model to identify based on the text characteristics and the picture type characteristics to obtain the text identification structure of the text picture to be identified. The recognition structure can be connected behind the text feature extraction structure and the picture type feature extraction structure, namely, the text feature and the picture type feature are used as input, and a text recognition result of the text picture to be recognized is output. Specifically, the recognition structure may adopt one or more fully connected layers, and take the text features and the picture type features as input, and then connect a conventional classifier after the fully connected layers, and the classifier outputs which character each text in the text picture belongs to, and takes each output character as the text recognition result of the text picture. The recognition structure is used for recognizing the text not only according to the text characteristics, but also by combining the picture type characteristics and the text characteristics of the text pictures, so that the picture type characteristics also become factors influencing recognition results, and characters corresponding to the text can be accurately recognized when the fonts, backgrounds and the like of the text pictures are different due to different text picture types, so that the universality of a text recognition model is improved, namely the universality of an OCR model is improved. The text recognition model may be used alone as an OCR model, or a text detection model may be connected to the front of the text recognition model, and the whole may be used as an OCR model.

In the embodiment, a text recognition model is trained in advance, a text feature extraction structure in the text recognition model is called to extract text features of a text picture to be recognized, a picture type feature extraction structure in the text recognition model is called to extract picture type features of the text picture to be recognized, the recognition structure is called again to perform recognition based on the text features and the picture type features to obtain a text recognition result of the text picture to be recognized, so that the picture type features also become factors influencing the recognition result, and characters corresponding to the text can be accurately recognized when the types of the text pictures are different, so that the universality of the text recognition model is improved, namely the universality of the OCR model is improved. Due to the fact that the universality of the text recognition model is improved, the model does not need to be trained respectively aiming at various types of text pictures, and only one universal model needs to be trained, so that the model training efficiency is improved, and the universality of training data is also improved.

Further, based on the first embodiment described above, a second embodiment of the text recognition method according to the present invention is proposed, and in this embodiment, the step S30 includes:

step S301, calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.

In this embodiment, the text picture may be divided into a plurality of width units in a preset width unit, for example, the width occupied by one character is taken as one width unit, if the text picture is a small picture with one line of characters, the text picture is divided into N width units, and N is the number of characters. The picture type feature extraction structure may be configured to extract a comprehensive feature of the text picture across a plurality of width units, and use the comprehensive feature as a picture type feature of the text picture to be recognized. The comprehensive features may be obtained by integrating features extracted from the width units, where the features extracted from the width units may be text features, or other features, such as background features, font features, and the like. Because the text features extracted by the text feature extraction structure are features of a single character, a plurality of characters, that is, features integrated in a plurality of width units, are not considered, and if the text is identified only by the text features, the influence of the features of the whole text picture on the identification result of the single character is ignored, so that the picture type feature extraction structure is adopted in the embodiment to extract the integrated features spanning the plurality of width units, and the integrated features can assist in the identification of the single character.

Further, the text features include sub-text features corresponding to each width unit of the text picture to be recognized, and the step S301 includes:

step S3011, calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain a comprehensive feature of the text picture to be recognized across multiple width units.

The text feature extraction structure may be configured to divide an input text picture into a plurality of width units, and extract text features, that is, sub-text features, of the text picture in each width unit, respectively, that is, each width unit corresponds to one sub-text feature. The text features extracted by the text feature extraction structure can be represented by a matrix (w, h), wherein w is a width dimension, and h is a feature dimension, so that the elements of the matrix represent the feature expression of each width unit on different feature dimensions. The comprehensive features across a plurality of width units extracted by the picture type feature extraction structure may be comprehensive features obtained by comprehensively processing sub-text features corresponding to the plurality of width units. In this case, the picture-type feature extraction structure may be set after the text feature extraction structure, and take a plurality of sub-text features as input, for example, take the above matrix (w, h) as input, the picture-type feature extraction structure may adopt one or more fully connected layers, and the output result may also be a vector or a matrix, which is used to represent the picture-type features of the extracted text picture.

Further, the step S3011 includes:

step a, calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, and obtaining a comprehensive feature of the text picture to be recognized across the multiple width units.

The picture type feature extraction structure may be a structure for calculating an average value, and the picture type feature extraction structure is called to calculate an average value of each sub-text feature corresponding to the plurality of width units, that is, calculate an average value of the plurality of sub-text features, and use the average value as a comprehensive feature spanning the plurality of width units. For example, when a text feature is represented by a matrix (w, h), each row of the matrix is a sub-text feature, an average value is calculated for a plurality of sub-text features, that is, an average value is calculated for each column of the matrix, and a vector a ═ a _1, a _2,. once, a _ h ] is obtained, where the vector a can represent a comprehensive feature spanning multiple width units, and the vector a is used as the extracted picture type feature. And then, calling an identification structure of the text identification model to identify based on the text characteristics and the picture type characteristics to obtain a text identification result of the text picture to be identified.

In this embodiment, the text feature extraction structure is called to extract the sub-text features corresponding to each width unit of the text picture, the picture type feature extraction structure is then adopted to calculate an average value of the plurality of sub-text features, the calculation result is adopted as the picture type feature of the text picture, and the re-tuning recognition structure recognizes the text recognition result of the text picture to be recognized based on the text features and the picture type features, so that in the recognition process, recognition is performed according to the text features of a single character, but the features formed by combining the text features of the characters are considered, so that recognition of the single character is more accurate, the problem that the recognition effect is poor when the background, the font and the like are changed is avoided, and therefore, better recognition effects are achieved for different types of text pictures.

Further, based on the first and second embodiments, a third embodiment of the text recognition method according to the present invention is provided, in which the recognition structure includes a first full-link layer and a second full-link layer, and step S40 includes:

step S401, inputting the sub-text features corresponding to a single width unit into the first full connection layer to obtain a first vector;

in this embodiment, the identification structure may be a structure including a first fully connected layer and a second fully connected layer. The first full connection layer is set as a connection text feature, and specifically, a sub-text feature can be used as an input; the second full connection layer is set to connect the picture type features, and the picture type features can be specifically used as input; the output of the first full-link layer and the output of the second full-link layer can be set as a vector, and the number of vector elements can be the total number of characters which can be recognized by the text recognition model.

The calling identification structure identifies based on the text feature and the picture type feature, specifically, the sub-text feature corresponding to a single width unit is input into the first full connection layer, and the first vector is obtained through output. Similarly, the sub-text features corresponding to the plurality of width units are sequentially input into the first full-link layer, so as to obtain first vectors corresponding to the plurality of width units respectively.

Step S402, inputting the picture type characteristics into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;

and inputting the picture type characteristics into a second full-connection layer, and outputting to obtain a second vector. The first vector and the second vector have the same dimension, that is, the number of elements of the vectors is the same.

Step S403, multiplying corresponding elements of the first vector and the second vector to obtain a third vector;

and multiplying the corresponding elements of the first vector and the second vector to obtain a third vector. The corresponding elements are multiplied, that is, the first element of the first vector is multiplied by the first element of the second vector, the second element of the first vector is multiplied by the second element of the second vector, and so on, and the result is taken as the element of the third vector, it can be understood that the number of the elements of the third vector is the same as that of the first vector and the second vector. Similarly, the plurality of first vectors are respectively multiplied by the corresponding elements of the second vector, and then a plurality of third vectors respectively corresponding to the width units are obtained. Since the first vector is derived from the text feature and the second vector is derived from the picture type feature, the third vector corresponding to each width unit integrates the text feature and the picture type feature.

Step S404, determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.

And determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit. Specifically, if one width unit corresponds to one character, each character can be recognized according to each third vector, and the recognition results of each character are combined to obtain the text recognition result of the text picture to be recognized.

Further, the identification structure further includes a decoder, and the step S404 includes:

step S4041, the third vectors corresponding to the width units are respectively input into the decoder to obtain target characters corresponding to the width units, and the target characters are used as the text recognition results of the text pictures to be recognized.

The recognition structure may further include a decoder, which may use a decoder used by a common classification model, and preset a correspondence between values of elements of the third vector and characters that can be recognized by the text recognition model. Specifically, the third vectors corresponding to the width units are respectively input into the decoder, the decoder decodes the third vectors to obtain target characters corresponding to the width units, and the target characters are combined in sequence to obtain a text recognition result of the text picture to be recognized. Further, the third vector may be subjected to a softmax (activation function) operation prior to accessing the decoder. Wherein, softmax is used in the multi-classification process to map the output of a plurality of neurons into the (0,1) interval.

Further, in an embodiment, the text picture to be recognized may be a small picture (a text picture containing one row or one column of text characters) processed by a text detection model; processing an input text picture to be recognized into a tensor T [ batch _ size, w, h ] through a backbone network and a head network in a text recognition structure, wherein the batch _ size can be the batch size of the input text picture to be recognized; w is calculated as follows: assuming that the text feature extraction structure can change the Width of an input text picture to be recognized into the original Width/x _ w, wherein Width is the Width of the input text picture to be recognized, and x _ w is a preset picture scaling ratio, and w is Width/x _ w; h is a preset characteristic dimension parameter. The first dimension of T represents the batch, each sub tensor in the batch is a matrix [ w, h ], and represents a feature matrix extracted from each text picture to be recognized in the batch. The average value of each column is calculated for this matrix [ w, h ] resulting in a vector a ═ a _1, a _ 2. Then the first fully-connected layer is connected to each row [ bw _1, bw _2,. till., bw _ h ] in each batch of T, and the second fully-connected layer is connected to a. The input dimensionality of the first full connection layer and the second full connection layer is h, and the output dimensionality is the total number c of characters which can be recognized by the text recognition model. The fully-connected layer essentially performs a matrix multiplication, and the corresponding matrices of the two fully-connected layers are denoted as D _1 and D _2, and the element in D _1 is denoted as D _1_ i _ j, and the element in D _2 is denoted as D _2_ i _ j, where 1< ═ i < ═ c, and 1< ═ j < ═ h. Then, the outputs (the first vector and the second vector) of the two fully-connected layers are multiplied by corresponding elements to obtain a final output, the output is a c-dimensional vector (a third vector) and is denoted as O, and each element O _ k can be expressed by a formula as follows: o _ k ═ (d _1_ k _1 × bw _1+ d _1_ k _2 × bw _2+. + d _1_ k _ h × bw _ h) (d _2_ k _1 × a _1+ d _2_ k _2 a _2+. + d _2_ k _ h a _ h), where 1 ═ k <, c. Then, d _2_ k _1 a _1+ d _2_ k _2 a _2+. + d _2_ k _ h a _ h is used to represent information such as font and background of the text picture to be recognized. And then, performing softmax operation on the vector O, and then connecting a decoder for decoding to obtain a text recognition result of the text picture to be recognized.

In this embodiment, the reason why the existing OCR model has no generality for different types of text pictures is that the main difference between different text pictures is the difference between the background and the font, and when different fonts and backgrounds are collocated, such a situation often occurs: characters look the same but are interpreted differently. For example, the character O represents the number 0 in some background and font, and the letter O in some font and background; and a character such as theta, which in some fonts and contexts represents the greek letter theta, while in other cases it may represent the number 0. So in general, a single OCR model cannot recognize nearly identical characters into two different results under different backgrounds and fonts. Therefore, in this embodiment, the text features and the picture type features of the text picture to be recognized are integrated through the first full-link layer and the second full-link layer of the recognition structure, so that the finally obtained recognition result takes the background and the font features of the text features to be recognized into account, and thus different classification results can be made for text pictures with different fonts and backgrounds even if the characters are almost the same, and the universality of the text recognition model is improved.

In addition, an embodiment of the present invention further provides a text recognition apparatus, and referring to fig. 3, the text recognition apparatus includes:

the acquiring module 10 is used for acquiring a text picture to be identified;

the text feature extraction module 20 is configured to invoke a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, where the text recognition model is obtained through pre-training;

the picture type feature extraction module 30 is configured to invoke a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;

and the identification module 40 is used for calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.

Further, the picture type feature extraction module 30 is further configured to:

the picture type feature extraction module 30 is further configured to:

Further, the identification structure comprises a first fully connected layer and a second fully connected layer, and the identification module 40 comprises:

a first input unit, configured to input the sub-text feature corresponding to a single width unit into the first full-link layer, so as to obtain a first vector;

a second input unit, configured to input the picture type feature into the second full-link layer to obtain a second vector, where dimensions of the first vector and the second vector are the same;

the calculating unit is used for multiplying corresponding elements of the first vector and the second vector to obtain a third vector;

and the determining unit is used for determining the text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.

Further, the identification structure further comprises a decoder, and the determining unit is further configured to:

Further, the obtaining module 10 includes:

the calling unit is used for calling a preset text detection model to detect the text position in the picture to be recognized;

and the cutting unit is used for cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.

The specific implementation of the text recognition apparatus of the present invention is basically the same as the embodiments of the text recognition method, and is not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a text recognition program is stored on the storage medium, and the text recognition program, when executed by a processor, implements the steps of the text recognition method as described below.

The embodiments of the text recognition device and the computer-readable storage medium of the present invention can refer to the embodiments of the text recognition method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text recognition method, characterized in that the text recognition method comprises the steps of:

acquiring a text picture to be identified;

2. The text recognition method of claim 1, wherein the step of invoking the picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized comprises:

3. The text recognition method according to claim 2, wherein the text features comprise sub-text features corresponding to the width units of the text picture to be recognized respectively,

4. The text recognition method of claim 3, wherein the recognition structure comprises a first fully-connected layer and a second fully-connected layer, and the step of calling the recognition structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition comprises:

5. The text recognition method of claim 4, wherein the recognition structure further comprises a decoder, and the step of determining the text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit comprises:

6. The method of claim 3, wherein the step of calling the picture type feature extraction structure to perform feature synthesis processing on each of the sub-text features of the text picture to be recognized in multiple width units to obtain a synthesized feature of the text picture to be recognized across multiple width units comprises:

7. The text recognition method of any one of claims 1 to 6, wherein the step of obtaining the text picture to be recognized comprises:

8. A text recognition apparatus, characterized in that the text recognition apparatus comprises:

the acquisition module is used for acquiring a text picture to be identified;

9. A text recognition apparatus characterized by comprising: memory, processor and text recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the text recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a text recognition program which, when executed by a processor, implements the steps of the text recognition method according to any one of claims 1 to 7.