CN116978021A

CN116978021A - Text character recognition method, device and storage medium

Info

Publication number: CN116978021A
Application number: CN202211144562.4A
Authority: CN
Inventors: 包志敏; 徐明亮; 曹浩宇; 王斌; 姜德强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2023-10-31

Abstract

The application discloses a text character recognition method, a text character recognition device and a storage medium, which can be applied to a map or a car networking scene containing character recognition. Extracting character features in an image to be identified by obtaining; the search vector is configured according to a preset bounding box; then, associating the search vector with the coding vector corresponding to the character feature to obtain a target text feature; and decoding based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information. Therefore, the text character recognition process based on the character granularity is realized, and the surrounding frame is adopted to position the text characters and recognize the text characters after the characters are aligned, so that the mutual interference among the characters is avoided, and the accuracy of the text characters is improved.

Description

Text character recognition method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for recognizing text characters, and a storage medium.

Background

Character recognition is one of the fundamental tasks of the core in optical character recognition (Optical Character Recognition, OCR), which is defined as outputting the character content of each detected text line picture after text detection.

Generally, the process of character recognition can be performed by training a neural network model, that is, learning the correspondence between the image features and the character content through the neural network model to perform character recognition.

However, the process of character recognition by learning the correspondence between the image features and the character content through the neural network model may be interfered by features between the characters, that is, adjacent characters may form new characters for recognition, so as to generate an incorrect recognition result, which affects the accuracy of text character recognition.

Disclosure of Invention

In view of this, the application provides a method for recognizing text characters, which can effectively improve the accuracy of text character recognition.

The first aspect of the present application provides a method for recognizing text characters, which can be applied to a system or a program including a function of recognizing text characters in a terminal device, and specifically includes:

acquiring an image to be recognized containing text characters, wherein the text characters are configured based on text rows;

extracting image features of the image to be identified to obtain the image features;

determining the position code of the text character in the image to be identified so as to obtain a position characteristic;

combining the position features with the image features to obtain character features;

Acquiring preset bounding boxes configured for the text lines so as to configure search vectors according to the preset bounding boxes, wherein the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text lines;

correlating the search vector with the coding vector corresponding to the character feature to obtain a target text feature;

decoding is carried out based on the target text characteristics, so that text content information and character position information corresponding to the image to be identified are obtained through identification according to the decoded characteristic information.

Optionally, in some possible implementations of the present application, the extracting image features from the image to be identified to obtain image features includes:

inputting the image to be identified into a feature extraction network to perform downsampling on the image to be identified to obtain features with preset heights;

reducing the number of channels based on a preset convolution kernel to reconstruct the features of the preset height to obtain the image features;

correspondingly, the determining the position code of the text character in the image to be recognized to obtain the position feature comprises the following steps:

acquiring image feature dimensions of the feature extraction network configuration;

And determining the position code of the text character in the image to be identified based on the image feature dimension to obtain the position feature, wherein the dimension of the position code is the same as the image feature dimension.

Optionally, in some possible implementations of the present application, associating the search vector with a coding vector corresponding to the character feature to obtain a target text feature includes:

performing self-attention operation on the search vector to obtain a search feature coding vector;

determining bounding box vectors corresponding to each preset bounding box based on the retrieval feature coding vectors;

performing mutual attention operation on the bounding box vectors and the coding vectors corresponding to the character features respectively to obtain the target text features;

correspondingly, the decoding is performed based on the target text feature, so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded feature information, and the method comprises the following steps:

decoding based on each bounding box vector indicated by the target text feature to obtain feature information of each preset bounding box after decoding and alignment;

And identifying the feature information of each preset bounding box after decoding and aligning so as to obtain text content information and character position information corresponding to the image to be identified.

Optionally, in some possible implementations of the present application, the method further includes:

if the text content is not searched in the preset bounding box, outputting text content information as blank items;

uploading the blank item to configure an expansion category based on the blank item, wherein the expansion category is used for indicating the identification of the content in the preset bounding box.

acquiring a position configuration content category corresponding to the character position information;

matching the text content information with the position configuration content category to obtain content detection information;

and adjusting the preset bounding box based on the content detection information.

Optionally, in some possible implementations of the present application, the training step of the text recognition model includes:

acquiring a training image, wherein the training image comprises a plurality of training text lines;

configuring preset bounding boxes based on the number of characters corresponding to the training text lines, wherein the number of the preset bounding boxes is larger than or equal to the number of characters corresponding to the training text lines;

Extracting image features of the training image to obtain training image features;

determining the position codes of training characters in the training images to obtain training position features;

combining the training image features with the training position features to obtain training character features;

and carrying out association mapping on the training character features and the content corresponding to the training characters based on the preset bounding box so as to train a text recognition model, wherein a decoder in the trained text recognition model is used for decoding the target text features.

Optionally, in some possible implementations of the present application, the training step of the text recognition model further includes:

acquiring text content distributed in the preset bounding box;

if the text content distributed in the preset bounding box is a training blank item, screening out the preset bounding box corresponding to the training blank item;

and carrying out association mapping on the training character features and the content corresponding to the training characters based on the screened preset bounding box so as to train the text recognition model.

A second aspect of the present application provides a text character recognition apparatus, including:

An acquisition unit configured to acquire an image to be recognized including text characters configured based on text lines;

the extraction unit is used for extracting image features of the image to be identified so as to obtain the image features;

the determining unit is used for determining the position codes of the text characters in the image to be recognized so as to obtain position characteristics;

the extraction unit is further used for combining the position features with the image features to obtain character features;

the acquiring unit is further configured to acquire a preset bounding box configured for the text line, so as to configure a search vector according to the preset bounding box, where the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text line;

the identification unit is used for associating the search vector with the coding vector corresponding to the character feature so as to obtain a target text feature;

the recognition unit is further used for decoding based on the target text characteristics so as to recognize and obtain text content information and character position information corresponding to the image to be recognized according to the decoded characteristic information.

Optionally, in some possible implementations of the present application, the extracting unit is specifically configured to input the image to be identified into a feature extraction network, so as to downsample the image to be identified to obtain features with a preset height;

The extraction unit is specifically configured to reduce the number of channels based on a preset convolution kernel, so as to reconstruct the features of the preset height to obtain the image features;

correspondingly, the determining unit is specifically configured to obtain an image feature dimension of the feature extraction network configuration;

the determining unit is specifically configured to determine the position code of the text character in the image to be identified based on the image feature dimension, so as to obtain the position feature, where the dimension of the position code is the same as the image feature dimension.

Optionally, in some possible implementations of the present application, the identifying unit is specifically configured to perform a self-attention operation on the search vector to obtain a search feature encoding vector;

the identification unit is specifically configured to determine a bounding box vector corresponding to each preset bounding box based on the search feature encoding vector;

the recognition unit is specifically configured to perform mutual attention operation on the bounding box vectors and the encoding vectors corresponding to the character features, so as to obtain the target text features;

correspondingly, the identification unit is specifically configured to decode based on each bounding box vector indicated by the target text feature, so as to obtain feature information after each preset bounding box is decoded and aligned;

The recognition unit is specifically configured to recognize the feature information after the decoding and alignment of each preset bounding box, so as to obtain text content information and character position information corresponding to the image to be recognized.

Optionally, in some possible implementation manners of the present application, the identifying unit is specifically configured to output text content information as a blank item if text content is not retrieved in the preset bounding box;

the identification unit is specifically configured to upload the blank item, so as to configure an extension category based on the blank item, where the extension category is used to indicate identification of content in the preset bounding box.

Optionally, in some possible implementation manners of the present application, the identifying unit is specifically configured to obtain a location configuration content category corresponding to the character location information;

the identification unit is specifically configured to match the text content information with the location configuration content category to obtain content detection information;

the identification unit is specifically configured to adjust the preset bounding box based on the content detection information.

Optionally, in some possible implementations of the present application, the identifying unit is specifically configured to obtain a training image, where the training image includes a plurality of training text lines;

The recognition unit is specifically configured to configure a preset bounding box based on the number of characters corresponding to the training text line, where the number of preset bounding boxes is greater than or equal to the number of characters corresponding to the training text line;

the recognition unit is specifically configured to extract image features of the training image to obtain training image features;

the recognition unit is specifically configured to determine a position code of a training character in the training image, so as to obtain a training position feature;

the recognition unit is specifically configured to combine the training image feature with the training position feature to obtain a training character feature;

the recognition unit is specifically configured to perform association mapping on the training character feature and content corresponding to the training character based on the preset bounding box, so as to train a text recognition model, where a decoder in the trained text recognition model is used to decode the target text feature.

Optionally, in some possible implementations of the present application, the identifying unit is specifically configured to obtain text content allocated in the preset bounding box;

the recognition unit is specifically configured to screen out a preset bounding box corresponding to the training blank item if text content allocated in the preset bounding box is the training blank item;

The recognition unit is specifically configured to perform association mapping on the training character feature and content corresponding to the training character based on the screened preset bounding box, so as to train the text recognition model.

A third aspect of the present application provides a computer apparatus comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to execute the method for recognizing a text character according to the first aspect or any one of the first aspects according to instructions in the program code.

A fourth aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of recognition of text characters of the first aspect or any one of the first aspects described above.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, the processor executing the computer instructions, causing the computer device to perform the method of recognition of text characters provided in the above-described first aspect or various alternative implementations of the first aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

configuring text characters based on text lines by acquiring an image to be recognized containing the text characters; then extracting image features of the image to be identified to obtain the image features; determining the position codes of text characters in the image to be identified to obtain position characteristics; combining the position features with the image features to obtain character features; further acquiring preset bounding boxes configured for the text lines, so as to configure search vectors according to the preset bounding boxes, wherein the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text lines; correlating the search vector with the code vector corresponding to the character feature to obtain a target text feature; and decoding based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information. Therefore, the text character recognition process based on the character granularity is realized, and the surrounding frame is adopted to position the text characters and recognize the text characters after the characters are aligned, so that the mutual interference among the characters is avoided, and the accuracy of the text characters is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a network architecture diagram of the operation of a text character recognition system;

FIG. 2 is a schematic diagram of a text character recognition process according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for recognizing text characters according to an embodiment of the present application;

fig. 4 is a schematic view of a scenario of a text character recognition method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a model structure of a method for recognizing text characters according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a model structure of another method for recognizing text characters according to an embodiment of the present application;

FIG. 7 is a schematic view of a scene of another method for recognizing text characters according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a text character recognition device according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a text character recognition method and a related device, which can be applied to a system or a program containing a text character recognition function in terminal equipment, wherein the text character is configured based on a text line by acquiring an image to be recognized containing the text character; then extracting image features of the image to be identified to obtain the image features; determining the position codes of text characters in the image to be identified to obtain position characteristics; combining the position features with the image features to obtain character features; further acquiring preset bounding boxes configured for the text lines, so as to configure search vectors according to the preset bounding boxes, wherein the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text lines; correlating the search vector with the code vector corresponding to the character feature to obtain a target text feature; and decoding based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information. Therefore, the text character recognition process based on the character granularity is realized, and the surrounding frame is adopted to position the text characters and recognize the text characters after the characters are aligned, so that the mutual interference among the characters is avoided, and the accuracy of the text characters is improved.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the method for recognizing text characters provided in the present application may be applied to a system or a program including a function for recognizing text characters in a terminal device, for example, a text character recognition application, specifically, the system for recognizing text characters may operate in a network architecture as shown in fig. 1, a network architecture diagram in which the system for recognizing text characters operates, as shown in fig. 1, the system for recognizing text characters may provide a process for recognizing text characters with a plurality of information sources, that is, a corresponding text image is sent to a server through a triggering operation at a terminal side, so as to recognize characters in the text image; it will be appreciated that various terminal devices are shown in fig. 1, the terminal devices may be computer devices, in an actual scenario, there may be more or less terminal devices participating in the text character recognition process, and the specific number and types are not limited herein, and in addition, one server is shown in fig. 1, but in an actual scenario, there may also be multiple servers participating, and the specific number of servers is determined by the actual scenario.

In this embodiment, the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. The terminals and servers may be directly or indirectly connected by wired or wireless communication, and the terminals and servers may be connected to form a blockchain network, which is not limited herein.

It will be appreciated that the above text character recognition system may be operated on a personal mobile terminal, such as: the text character recognition application can be used as an application which can also be run on a server, and can also be used as a third party device to provide recognition of text characters so as to obtain a recognition processing result of the text characters of the information source; the specific text character recognition system may be in a program form, may also be operated as a system component in the device, and may also be used as a cloud service program, where the specific operation mode is determined by an actual scenario and is not limited herein.

In order to solve the above problems, the present application proposes a text character recognition method, which is applied to a text character recognition flow frame shown in fig. 2, as shown in fig. 2, and is a text character recognition flow frame diagram provided in an embodiment of the present application, and the server receives a corresponding image to be recognized through a recognition request sent by a terminal, extracts character features and position codes therein, and configures a corresponding bounding box as a search vector to perform targeted content alignment, thereby implementing a content positioning recognition process.

It can be understood that the method provided by the application can be a program writing method, which is used as a processing logic in a hardware system, and can also be used as a text character recognition device, and the processing logic is realized in an integrated or external mode. As one implementation, the recognition device of the text characters is configured based on text lines by acquiring an image to be recognized containing the text characters; then extracting image features of the image to be identified to obtain the image features; determining the position codes of text characters in the image to be identified to obtain position characteristics; combining the position features with the image features to obtain character features; further acquiring preset bounding boxes configured for the text lines, so as to configure search vectors according to the preset bounding boxes, wherein the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text lines; correlating the search vector with the code vector corresponding to the character feature to obtain a target text feature; and decoding based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information. Therefore, the text character recognition process based on the character granularity is realized, and the surrounding frame is adopted to position the text characters and recognize the text characters after the characters are aligned, so that the mutual interference among the characters is avoided, and the accuracy of the text characters is improved.

The scheme provided by the embodiment of the application relates to an artificial intelligence computer vision technology, and is specifically described by the following embodiments:

with reference to fig. 3, fig. 3 is a flowchart of a text character recognition method provided by an embodiment of the present application, where the management method may be executed by a server or a terminal, and the embodiment of the present application at least includes the following steps:

301. and acquiring an image to be recognized containing the text characters.

In this embodiment, the text characters are configured based on text lines, that is, the image to be recognized includes a plurality of lines of characters, and content recognition under the specific character granularity is required for the plurality of lines of characters.

In a possible character recognition scenario, as shown in fig. 4, fig. 4 is a schematic diagram of a text character recognition method according to an embodiment of the present application; character positions can generally be predicted by adding character detectors. However, strict alignment cannot be ensured between the pure character detector and the recognition result, and once the problems of missed detection, false detection and the like occur, the problem that the character position is not matched with the character content is caused as shown in fig. 4; since the character recognition task generally adopts the CRNN or seq2seq algorithm, the algorithm is a frame of character sequence prediction, and does not support character position positioning, as shown in fig. 4, an algorithm capable of completing character recognition and positioning is required to be designed to solve the problem of accurate character position prediction, and meanwhile, the alignment problem in pure character detection and character recognition matching is avoided.

302. And inputting the image to be identified into a text identification model, and extracting image characteristics of the image to be identified to obtain the image characteristics.

In this embodiment, the text recognition model includes a feature extraction network, a position encoder, a multi-head decoder, and an output network, and the process of extracting image features of an image to be recognized is performed through the feature extraction network, where the feature extraction network is a convolutional neural network module, and uses the convolutional neural network as the feature extraction network to obtain image features.

The convolutional neural network may be Resnet-50, resnet-50-DC5, resnet-101, etc., and is not limited herein.

Specifically, in the process of extracting image features, firstly, an image to be identified is input into a feature extraction network to perform downsampling on the image to be identified to obtain features with preset heights; then reducing the number of channels based on a preset convolution kernel so as to reconstruct the features with preset heights to obtain image features; the acquired picture features are subjected to 1x1 convolution to adjust the number of channels, so that feature vectors with dxHxW dimension are obtained, and the feature vectors are reshaped into dx (HxW), added with position coding information sptial positional encoding and sent to a transducer encoder structure.

Correspondingly, for the acquisition of the position features, the image feature dimension of the feature extraction network configuration can be acquired; and then determining the position codes of the text characters in the image to be recognized based on the image feature dimensions to obtain position features, namely the dimensions of the position codes are identical with the dimensions of the image features.

In a possible scenario, a text recognition model is shown in fig. 5, and fig. 5 is a schematic diagram of a model structure of a text character recognition method according to an embodiment of the present application; the method comprises a feature extraction network, a position encoder, a multi-head decoder and an output network, wherein features are extracted through the feature extraction network. For example using Resnet to obtain feature vectors for pictures. The acquired features are then fed into a multi-head encoder, for example, a transform encoder comprising M layers. The search vector (Object vectors) and the multi-head encoder output are then fed into a multi-head decoder (N-layer transform decoder) for decoding. And respectively predicting characters and character positions of the decoded features through an output network (for example, two Multilayer Perceptron) to obtain text line recognition results and character positions.

303. And determining the position codes of the text characters in the image to be recognized so as to obtain the position characteristics.

In this embodiment, determining the position encoding of the text character in the image to be recognized is performed by a position encoder in the text recognition model, so as to obtain the position information of the text sequence.

304. Combining the position features with the image features to obtain character features.

In this embodiment, the process of combining the position feature and the image feature is to adjust the number of channels by using 1x1 convolution to obtain a feature vector with dxHxW dimension, and add the feature vector with the position coding information sptial positional encoding after reshape is performed to dx (HxW), and send the feature vector into the multi-head encoder structure.

305. A preset bounding box configured for the text line is acquired to configure a search vector according to the preset bounding box.

In this embodiment, the number of preset bounding boxes is greater than or equal to the number of characters corresponding to the text line; therefore, each character in the text line can be distributed to a corresponding preset bounding box to carry out the identification process in the range.

In particular, the process of configuring the search vector is performed by a multi-headed decoder, the inputs of which are mainly two, respectively the position-coded encoder output, and the object queries vector. The object queries vector functions like an anchor boxes in a convolutional neural network-based object detection algorithm, whose dimension is batch_size x N x 256. Where N is a set hyper-parameter, the value requirement is greater than the maximum target detection number of each picture in the model training process, and here refers to the number of text line characters, for example, n=6.

In particular, the multi-head encoder may be a transform encoder structure, with an MLP structure added to the multi-head self-attention model. In one example the encoding structure adds up to 6 layers, the output of each layer will be the input of the next layer. The output of the last layer enters the decoder structure as the output of the encoding structure.

It will be appreciated that a variable transducer may also be used as a component of the multi-headed encoder, the specific form being a function of the actual scenario.

306. And correlating the search vector with the coding vector corresponding to the character feature to obtain the target text feature.

In this embodiment, the target text feature is the output of the multi-head encoder, that is, the search vector is first self-attentively operated to obtain the search feature encoding vector; then determining a bounding box vector corresponding to each preset bounding box based on the retrieval feature coding vector; performing mutual attention operation on the bounding box vectors and the coding vectors corresponding to the character features respectively to obtain target text features; after the search vector (Object vectors) performs self attention operation (self attention) through a multi-head decoder and cross attention operation (cross attention) with the output of the encoder, N target text features (decoder output embedding) are obtained, and the final character and the position thereof are further predicted through Feed Forward Neural Network in the final stage, that is, decoding is performed on each bounding box vector indicated by the target text features to obtain feature information after decoding and alignment of each preset bounding box; and then identifying the characteristic information of each preset bounding box after decoding and aligning so as to obtain text content information and character position information corresponding to the image to be identified.

In addition, if the text content is not searched in the preset bounding box, outputting text content information as blank items; the blank items are then uploaded to configure an extension category based on the blank items, the extension category being used to indicate the identification of the content in the preset bounding box, thereby extending the identifiable type of the character, such as the identification of a special character. I.e. for the output network, its prediction header employs three layers Multilayer Perceptron, employing the Relu activation function. Co-predicting N characters and positions, a portion greater than the actual number of characters is predicted as a no object, which is a new class of structures to distinguish other arbitrary characters.

Further, since the content rules corresponding to different text positions may be different, for example, the last bounding box content of the text line is a punctuation, content verification may be performed based on the position, that is, the position configuration content category corresponding to the character position information may be obtained first; then matching the text content information with the position configuration content category to obtain content detection information; and adjusting the preset bounding box based on the content detection information, so that the accuracy of text recognition is improved.

In a possible scenario, if the number of preset bounding boxes is 6, the corresponding recognition process is shown in fig. 6, and fig. 6 is a schematic diagram of a model structure of another text character recognition method according to the embodiment of the present application; the figure shows the extraction of image features through a feature extraction network (e.g. CNN), downsampling the original image of [3, h0, w0] to features of [ batch_size, C0,1, w0/32] of height 1, where c0=2048; reducing the number of channels by using 1x1 convolution, combining two dimensions H and W, and reconstructing reshape to be [ HW, batch_size, C ], wherein C=256; a position code (position embeding) of the same dimension is then added to the feature by a position encoder, which is then input to a multi-headed encoder (e.g., an encoder module consisting of several transducers) for feature encoding, and output as [ HW, batch_size, C ].

It can be understood that, assuming that the number of characters in the text line does not exceed N, N search vectors of C dimension, called object queries, are set, and after self-section encoding, cross-section is performed with the output of the encoder by the multi-head decoder, and the output is [ N, batch_size, C ].

Further, through an output network (for example, two feedforward neural networks), for N number of token, each token outputs a corresponding character class (total class+1 character class) and a character position (pos_center, h, w), for the token output of which the corresponding character and position are retrieved, the non-retrieved token output is null, i.e., the "text recognition" is 4 words, and the preset bounding box is 6, and the redundant 2 preset bounding boxes output is null.

307. Decoding is carried out based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information.

In this embodiment, an image to be identified containing text lines is encoded into a feature vector after passing through an encoder consisting of a feature extraction network (CNN) and a multi-head encoder (transducer). The decoder then obtains different decoded feature vectors by applying to the input features and different search vectors (objects). Finally, the coordinates and characters of N different bounding boxes are obtained by inputting the decoded feature vector into the MLP.

Specifically, the recognition result is shown in fig. 7, and fig. 7 is a schematic view of a scene of another text character recognition method according to an embodiment of the present application; in other words, the embodiment adopts an end-to-end structure to directly infer the original picture and outputs the character sequence and the corresponding position. The model directly infers the pictures to get a character sequence and bounding boxes. The model carries out one-to-one matching on the reasoning results, so that the final predicted sequence is the same as the group trunk sequence, and the problem of out-of-order cannot occur.

It can be understood that the model fixes the number of preset bounding boxes (num_queries), and each num_query generates a character classification through different feedforward neural network, specifically through class unbinding, and generates a bounding box through a fully connected network; further generating the classification attribute and the position information of the single font, the problem of multiple leak detection mismatch is avoided.

For model training, the model supervises both the sequence loss (cross entropy loss) and the position loss (gious loss) and supervises the loss of each layer of output in the model transformer decoder network so that the model can converge faster and no errors accumulate.

It can be appreciated that the training process trains through pictures and labeling data labeled with character positions and character contents, and has the capability of predicting characters and corresponding positions.

Specifically, in the training process, a training image is firstly obtained, and the training image comprises a plurality of training text lines; then configuring preset bounding boxes based on the number of characters corresponding to the training text lines, wherein the number of the preset bounding boxes is larger than or equal to the number of characters corresponding to the training text lines; extracting image features of the training image to obtain training image features; further determining the position codes of training characters in the training images to obtain training position features; and combining the training image features with the training position features to obtain training character features.

For trained mapping, the output of each layer of decoders of the multi-headed encoder is fed forward to a neural network (Feedforward Neural Network, FNN) to predict the position and character, and the final loss comprises the FFN prediction result among the layers of decoders and the additional loss of the classification mark (group trunk) of the training set. The matching of the predicted result and the group trunk has two schemes, one scheme is that the predicted result is directly mapped one to one, the result of the forced network prediction is matched with the sequence of the group trunk, namely, the training character features and the content corresponding to the training characters are mapped in a correlated mode based on a preset bounding box so as to train a text recognition model, and a decoder is used for decoding target text features.

In addition, one-to-one matching with the group trunk can be performed after no object in the predicted sequence is removed by screening. Wherein the position loss is constituted by iou loss and L1 loss, and the loss of character recognition is constituted by log likelihood loss. Namely, acquiring text contents distributed in a preset bounding box; if the text content distributed in the preset bounding box is the training blank item, screening the preset bounding box corresponding to the training blank item; and then, carrying out association mapping on training character features and contents corresponding to training characters based on the screened preset bounding box so as to train the text recognition model, thereby improving the accuracy of matching the contents of the preset bounding box.

As can be seen from the above embodiments, by acquiring an image to be recognized including text characters, the text characters are configured based on text lines; then extracting image features of the image to be identified to obtain the image features; determining the position codes of text characters in the image to be identified to obtain position characteristics; combining the position features with the image features to obtain character features; further acquiring preset bounding boxes configured for the text lines, so as to configure search vectors according to the preset bounding boxes, wherein the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text lines; correlating the search vector with the code vector corresponding to the character feature to obtain a target text feature; and decoding based on the target text characteristics so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded characteristic information. Therefore, the text character recognition process based on the character granularity is realized, and the surrounding frame is adopted to position the text characters and recognize the text characters after the characters are aligned, so that the mutual interference among the characters is avoided, and the accuracy of the text characters is improved.

In order to better implement the above-described aspects of the embodiments of the present application, the following provides related apparatuses for implementing the above-described aspects. Referring to fig. 8, fig. 8 is a schematic structural diagram of a text character recognition device according to an embodiment of the present application, and the text character recognition device 800 includes:

an obtaining unit 801, configured to obtain an image to be recognized including text characters, where the text characters are configured based on text lines;

an extracting unit 802, configured to extract image features of the image to be identified, so as to obtain image features;

a determining unit 803, configured to determine a position code of the text character in the image to be identified, so as to obtain a position feature;

the extracting unit 802 is further configured to combine the position feature with the image feature to obtain a character feature;

the obtaining unit 801 is further configured to obtain a preset bounding box configured for the text line, so as to configure a search vector according to the preset bounding box, where the number of the preset bounding boxes is greater than or equal to the number of characters corresponding to the text line;

an identifying unit 804, configured to correlate the search vector with a coding vector corresponding to the character feature, so as to obtain a target text feature;

The identifying unit 804 is further configured to decode based on the target text feature, so as to identify and obtain text content information and character position information corresponding to the image to be identified according to the decoded feature information.

Optionally, in some possible implementations of the present application, the extracting unit 802 is specifically configured to input the image to be identified into a feature extraction network, so as to downsample the image to be identified to obtain features with a preset height;

the extracting unit 802 is specifically configured to reduce the number of channels based on a preset convolution kernel, so as to reconstruct the features of the preset height to obtain the image features;

correspondingly, the determining unit 803 is specifically configured to obtain an image feature dimension of the feature extraction network configuration;

the determining unit 803 is specifically configured to determine, based on the image feature dimension, the position code of the text character in the image to be identified, so as to obtain the position feature, where the dimension of the position code is the same as the image feature dimension.

Optionally, in some possible implementations of the present application, the identifying unit 804 is specifically configured to perform a self-attention operation on the search vector to obtain a search feature encoding vector;

The identifying unit 804 is specifically configured to determine a bounding box vector corresponding to each preset bounding box based on the search feature encoding vector;

the identifying unit 804 is specifically configured to perform mutual attention operation on the bounding box vectors and the encoding vectors corresponding to the character features, so as to obtain the target text features;

correspondingly, the identifying unit 804 is specifically configured to decode based on each bounding box vector indicated by the target text feature, so as to obtain feature information after each preset bounding box is decoded and aligned;

the identifying unit 804 is specifically configured to identify the feature information after decoding and aligning each preset bounding box, so as to obtain text content information and character position information corresponding to the image to be identified.

Optionally, in some possible implementations of the present application, the identifying unit 804 is specifically configured to output text content information as a blank item if text content is not retrieved in the preset bounding box;

the identifying unit 804 is specifically configured to upload the blank item, so as to configure an extension category based on the blank item, where the extension category is used to indicate identification of the content in the preset bounding box.

Optionally, in some possible implementation manners of the present application, the identifying unit 804 is specifically configured to obtain a location configuration content category corresponding to the character location information;

the identifying unit 804 is specifically configured to match the text content information with the location configuration content category to obtain content detection information;

the identifying unit 804 is specifically configured to adjust the preset bounding box based on the content detection information.

Optionally, in some possible implementations of the present application, the identifying unit 804 is specifically configured to obtain a training image, where the training image includes a plurality of training text lines;

the identifying unit 804 is specifically configured to configure a preset bounding box based on the number of characters corresponding to the training text line, where the number of preset bounding boxes is greater than or equal to the number of characters corresponding to the training text line;

the identifying unit 804 is specifically configured to extract image features of the training image to obtain training image features;

the identifying unit 804 is specifically configured to determine a position code of a training character in the training image, so as to obtain a training position feature;

The identifying unit 804 is specifically configured to combine the training image feature with the training position feature to obtain a training character feature;

the recognition unit 804 is specifically configured to perform association mapping on the training character feature and content corresponding to the training character based on the preset bounding box, so as to train a text recognition model, where a decoder in the trained text recognition model is used to decode the target text feature.

Optionally, in some possible implementations of the present application, the identifying unit 804 is specifically configured to obtain text content allocated in the preset bounding box;

the identifying unit 804 is specifically configured to screen out a preset bounding box corresponding to the training blank item if the text content allocated in the preset bounding box is the training blank item;

the recognition unit 804 is specifically configured to perform association mapping on the training character feature and the content corresponding to the training character based on the screened preset bounding box, so as to train the text recognition model.

The embodiment of the present application further provides a terminal device, as shown in fig. 9, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, for convenience of explanation, only the portion related to the embodiment of the present application is shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as an example of the mobile phone:

fig. 9 is a block diagram showing a part of the structure of a mobile phone related to a terminal provided by an embodiment of the present application. Referring to fig. 9, the mobile phone includes: radio Frequency (RF) circuitry 910, memory 920, input unit 930, display unit 940, sensor 950, audio circuitry 960, wireless fidelity (wireless fidelity, wiFi) module 970, processor 980, and power source 990. It will be appreciated by those skilled in the art that the handset construction shown in fig. 9 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 9:

the RF circuit 910 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 980; in addition, the data of the design uplink is sent to the base station. Typically, the RF circuitry 910 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 910 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS), and the like.

The memory 920 may be used to store software programs and modules, and the processor 980 performs various functional applications and data processing by operating the software programs and modules stored in the memory 920. The memory 920 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 920 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 930 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 930 may include a touch panel 931 and other input devices 932. The touch panel 931, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on or thereabout the touch panel 931 using a finger, a stylus, or any other suitable object or accessory, and spaced touch operations within a certain range on the touch panel 931), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 931 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 980, and can receive commands from the processor 980 and execute them. In addition, the touch panel 931 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 930 may include other input devices 932 in addition to the touch panel 931. In particular, other input devices 932 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 940 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 940 may include a display panel 941, and alternatively, the display panel 941 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 931 may overlay the display panel 941, and when the touch panel 931 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 980 to determine a type of touch event, and then the processor 980 provides a corresponding visual output on the display panel 941 according to the type of touch event. Although in fig. 9, the touch panel 931 and the display panel 941 are implemented as two separate components for the input and output functions of the mobile phone, in some embodiments, the touch panel 931 may be integrated with the display panel 941 to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 950, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 941 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 941 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 960, speaker 961, microphone 962 may provide an audio interface between a user and a cell phone. Audio circuit 960 may transmit the received electrical signal converted from audio data to speaker 961, where it is converted to a sound signal by speaker 961 for output; on the other hand, microphone 962 converts the collected sound signals into electrical signals, which are received by audio circuit 960 and converted into audio data, which are processed by audio data output processor 980 for transmission to, for example, another cell phone via RF circuit 910 or for output to memory 920 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 970, so that wireless broadband Internet access is provided for the user. Although fig. 9 shows a WiFi module 970, it is understood that it does not belong to the necessary constitution of the handset, and can be omitted entirely as needed within the scope of not changing the essence of the invention.

The processor 980 is a control center of the handset, connecting various parts of the entire handset using various interfaces and lines, performing various functions and processing data of the handset by running or executing software programs and/or modules stored in the memory 920, and invoking data stored in the memory 920, thereby performing overall monitoring of the handset. Optionally, processor 980 may include one or more processing units; alternatively, processor 980 may integrate an application processor with a modem processor, where the application processor primarily handles operating systems, user interfaces, applications programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 980.

The handset further includes a power supply 990 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 980 through a power management system, such as by performing charge, discharge, and power management functions via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In the embodiment of the present application, the processor 980 included in the terminal further has a function of executing each step of the page processing method as described above.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1022 (e.g., one or more processors) and a memory 1032, one or more storage media 1030 (e.g., one or more mass storage devices) storing application programs 1042 or data 1044. Wherein memory 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, central processor 1022 may be configured to communicate with storage medium 1030 to perform a series of instruction operations in storage medium 1030 on server 1000.

The server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The steps performed by the management apparatus in the above-described embodiments may be based on the server structure shown in fig. 10.

In an embodiment of the present application, there is further provided a computer readable storage medium having stored therein text character recognition instructions that, when executed on a computer, cause the computer to perform the steps performed by the text character recognition apparatus in the method described in the embodiment of fig. 3 to 7.

There is also provided in an embodiment of the application a computer program product comprising instructions for recognition of text characters, which when run on a computer causes the computer to perform the steps performed by the text character recognition device in the method described in the embodiment of figures 3 to 7.

The embodiment of the application also provides a text character recognition system, which can comprise a text character recognition device in the embodiment shown in fig. 8, or a terminal device in the embodiment shown in fig. 9, or a server shown in fig. 10.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a text character recognition device, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for recognizing text characters, comprising:

inputting the image to be identified into a text identification model to extract image characteristics of the image to be identified so as to obtain the image characteristics;

2. The method according to claim 1, wherein the extracting the image features from the image to be identified to obtain the image features includes:

3. The method of claim 1, wherein associating the search vector with the encoded vector corresponding to the character feature to obtain a target text feature comprises:

4. A method according to claim 3, characterized in that the method further comprises:

5. A method according to claim 3, characterized in that the method further comprises:

6. The method of any of claims 1-5, wherein the text recognition model is trained by steps comprising:

and carrying out association mapping on the training character features and the content corresponding to the training characters based on the preset bounding box so as to train the text recognition model, wherein a decoder in the trained text recognition model is used for decoding the target text features.

7. The method of claim 6, wherein the training step of the text recognition model further comprises:

acquiring text content distributed in the preset bounding box;

8. A text character recognition apparatus, comprising:

9. A computer device, the computer device comprising a processor and a memory:

the memory is used for storing program codes; the processor is configured to perform the method of recognizing text characters according to any one of claims 1 to 7 according to instructions in the program code.

10. A computer program product comprising computer programs/instructions stored on a computer readable storage medium, characterized in that the computer programs/instructions in the computer readable storage medium, when executed by a processor, implement the steps of the method of recognition of text characters according to any of the preceding claims 1 to 7.

11. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of recognizing text characters according to any one of the preceding claims 1 to 7.