CN116189208A

CN116189208A - Method, apparatus, device and medium for text recognition

Info

Publication number: CN116189208A
Application number: CN202310118841.1A
Authority: CN
Inventors: 张家鑫; 黄灿
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-02-03
Filing date: 2023-02-03
Publication date: 2023-05-30

Abstract

Embodiments of the present disclosure provide methods, apparatus, devices, and media for text recognition. A method for text recognition comprising: determining corner feature representation of the image by performing corner detection on the image to be identified, wherein the corner feature representation characterizes at least one corner in the image, and the image comprises a text sequence; extracting a text feature representation for the text sequence from the image; determining an attention weight for the text feature representation based on the corner feature representation; and generating a recognition result for the text sequence based on the text feature representation and the attention weight. Thereby, the accuracy of the recognition of the text sequence in the image can be ensured.

Description

Method, apparatus, device and medium for text recognition

Technical Field

Example embodiments of the present disclosure relate generally to the field of computers and, more particularly, relate to methods, apparatuses, devices, and computer-readable storage media for text recognition.

Background

With the development of machine learning technology, machine learning models have been available to perform tasks in a variety of application environments. Model-based visual tasks are used to process visual data, such as images, video, and the like. Examples of visual tasks include, but are not limited to, image classification, object detection, semantic segmentation, optical Character Recognition (OCR), and the like. In models of visual tasks, the challenge is how to extract features from image data that can accurately characterize the target information, and generate the target information based on the features. The model for extracting the feature representation of the image is commonly referred to as an encoder (decoder) and the model for generating the target information based on the feature representation is commonly referred to as a decoder (decoder).

Disclosure of Invention

In a first aspect of the present disclosure, a method for text recognition is provided. The method comprises the following steps: determining corner feature representation of the image by performing corner detection on the image to be identified, wherein the corner feature representation characterizes at least one corner in the image, and the image comprises a text sequence; extracting a text feature representation for the text sequence from the image; determining an attention weight for the text feature representation based on the corner feature representation; and generating a recognition result for the text sequence based on the text feature representation and the attention weight.

In a second aspect of the present disclosure, an apparatus for text recognition is provided. The device comprises: the corner detection module is configured to determine corner feature representation of the image by executing corner detection on the image to be identified, wherein the corner feature representation characterizes at least one corner in the image, and the image comprises a text sequence; a text feature extraction module configured to extract a text feature representation for a text sequence from an image; a weight determination module configured to determine an attention weight for the text feature representation based on the corner feature representation; and a result generation module configured to generate a recognition result for the text sequence based on the text feature representation and the attention weight.

In a third aspect of the present disclosure, an electronic device is provided. The apparatus comprises at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by at least one processing unit, cause the apparatus to perform the method of the first aspect.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. A medium has stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.

It should be understood that what is described in this section of the disclosure is not intended to limit key features or essential features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 illustrates a schematic diagram of an example architecture of text recognition, according to some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of an example of implementing text recognition using an encoder-decoder model, according to some embodiments of the present disclosure;

FIG. 4 illustrates a schematic diagram of an example of implementing text recognition using a consumer model, according to some embodiments of the present disclosure;

FIG. 5 illustrates a flow chart of a process for text recognition according to some embodiments of the present disclosure;

FIG. 6 illustrates a block diagram of an apparatus for text recognition according to some embodiments of the present disclosure; and

fig. 7 illustrates a block diagram of an electronic device in which one or more embodiments of the disclosure may be implemented.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be more thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below. As used herein, the term "model" may represent an associative relationship between individual data. For example, the above-described association relationship may be obtained based on various technical schemes currently known and/or to be developed in the future.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting embodiment, in response to receiving an active request from a user, the prompt may be sent to the user, for example, in a pop-up window in which the prompt may be presented in text. In addition, a selection control for the user to select "agree" or "disagree" to provide personal information to the electronic device may also be carried in the pop-up window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the embodiments of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the embodiments of the present disclosure.

As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The neural network model is one example of a deep learning-based model. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "learning network," which terms are used interchangeably herein.

A "neural network" is a machine learning network based on deep learning. The neural network is capable of processing the input and providing a corresponding output, which generally includes an input layer and an output layer, and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, thereby increasing the depth of the network. The layers of the neural network are connected in sequence such that the output of the previous layer is provided as an input to the subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is provided as the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each of which processes input from a previous layer.

Generally, machine learning may generally include three phases, namely a training phase, a testing phase, and an application phase (also referred to as an inference phase). In the training phase, a given model may be trained using a large amount of training data, iteratively updating parameter values until the model is able to obtain consistent inferences from the training data that meet the desired goal. By training, the model may be considered to be able to learn the association between input and output (also referred to as input to output mapping) from the training data. Parameter values of the trained model are determined. In the test phase, test inputs are applied to the trained model to test whether the model is capable of providing the correct outputs, thereby determining the performance of the model. In the application phase, the model may be used to process the actual input based on the trained parameter values, determining the corresponding output.

As briefly mentioned above, in models of visual tasks, the challenge is how to extract features from image data that can accurately characterize target information, and generate the target information based on the features. In the case where the visual task is text recognition (i.e., OCR) of an image, the target information is a sequence of text contained in the image. Training text recognition models has been proposed to feature images for recognizing text sequences contained in the images. However, in some cases, the text sequence contained in the image data is different from a conventional text sequence (e.g., an artistic word), and a scheme of directly text-recognizing the image data sometimes cannot accurately obtain the text sequence contained in the image data, resulting in poor robustness of text recognition.

The embodiment of the disclosure provides a scheme for text recognition. According to this scheme, corner feature representations of the image and text feature representations of the image for the text sequence are extracted from the image, respectively, and recognition results for the text sequence are generated based on the corner feature representations and the text feature representations. In this way, the robustness of text recognition can be improved, and recognition of various text sequences in an image can be more accurately realized.

Example Environment

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure may be implemented. As shown in fig. 1, environment 100 may include an electronic device 120.

The image 110 includes a text sequence therein, and the electronic device 120 may perform text recognition on the image 110 to obtain a text recognition result 130, which indicates the text sequence in the image 110. The text sequence may contain text units in any font, in any language. In some cases, the text sequence may also contain an artistic word. The artistic word is a deformed font based on general characters and subjected to artistic processing, and is a font deformation with pattern meaning or decoration meaning.

In some embodiments, the electronic device 120 may support or run applications that support text recognition functionality. Upon receiving the text recognition instruction, the electronic device 120 may perform text recognition on the image 110 to obtain a text recognition result 130, that is, a text sequence included in the image 110.

In some embodiments, the electronic device 120 may process the image 110 using image processing techniques to enable text recognition. The image processing techniques may include, for example, image preprocessing techniques, image segmentation techniques, image recognition techniques, and the like. In some embodiments, the electronic device 120 may also employ a machine learning model, such as a neural network, to text identify the image 110. The neural network may include, for example, a convolutional neural network, a feed forward neural network, a recurrent neural network, and the like.

Electronic device 120 may include any computing system having computing capabilities, such as various computing devices/systems, terminal devices, servers, etc. The terminal device may relate to any type of mobile terminal, fixed terminal, or portable terminal, including mobile handsets, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination of the foregoing, including accessories and peripherals of these devices, or any combination thereof. Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and the like.

In some embodiments, the electronic device 120 may include an image acquisition component (e.g., a camera). The electronic device 120 may capture the image 110 through an image acquisition device. In some embodiments, the electronic device 120 may also receive images 110 sent by other electronic devices, or obtain images 110 cached by itself.

In some embodiments, the image 110 for text recognition may be part of a larger image. An image region containing the text sequence may be located from a larger image and segmented to obtain an image 110 for text recognition. Of course, the image 110 may itself be a complete image.

It should be understood that the components and arrangements in environment 100 shown in fig. 1 are merely examples, and that a computing system suitable for implementing the example implementations described in this disclosure may include one or more different components, other components, and/or different arrangements.

Some example embodiments of the present disclosure will be described below with continued reference to the accompanying drawings.

Example architecture

Fig. 2 illustrates a schematic diagram of an example architecture 200 of text recognition, according to some embodiments of the present disclosure. In general, architecture 200 includes corner detection module 210, first feature extraction module 220, attention module 230, and text recognition module 240. These elements and other elements not shown in architecture 200 may be implemented in electronic device 120.

The image 110 contains a sequence of text. In some embodiments, the text sequence contained in the image 110 may include at least one artistic word. Of course, the text sequence in image 110 may also include other types of text. Since different fonts, particularly artistic words, have their own characteristics, the electronic device 120 needs to determine invariant features (e.g., corner points) of the image 110 in order to ensure accuracy of text recognition for the image 110. Corner points are typically points in the image data where an attribute is particularly prominent, and may be isolated points with maximum or minimum intensity on some attribute, end points of line segments, etc. Corner points may also be referred to as key points, feature points, points of interest, etc.

The corner detection module 210 is configured to determine a corner feature representation 201 of the image 110 by performing corner detection on the image 110. The corner feature representation 201 characterizes at least one corner in the image 110. Herein, the feature representation has a multidimensional vector form for characterizing the corresponding information. The corner detection module 210 determines the corner of the image 110, and effectively reduces the data volume of subsequent calculation while maintaining important features of the image 110, thereby improving the speed and accuracy of calculation.

In some embodiments, the corner detection module 210 determines the corner of the image 110 by a corner detection algorithm. The corner detection algorithm can be roughly classified into a corner detection based on a gray image, a corner detection based on a binary image, and a corner detection based on a contour curve. The corner detection algorithm may include, for example, harris corner detection algorithm, shi-Tomasi corner detection algorithm, SIFT corner detection algorithm, FAST corner detection algorithm, and the like. In some embodiments, corner detection module 210 may also perform corner detection using a trained corner detection model. The specific type of model structure can be selected according to the actual application requirements.

After the corner detection module 210 determines at least one corner of the image 110, a corner feature representation 201 for characterizing the at least one corner is determined. In some embodiments, the corner detection module 210 may first determine a corner plot of the image 110, and then extract corners in the corner plot to determine a corner feature representation of the image 110. Such an example embodiment is described below with reference to fig. 3. Fig. 3 illustrates a schematic diagram of an example of implementing text recognition using an encoder-decoder model, according to some embodiments of the present disclosure. As shown in fig. 3, the corner point detection module 210 includes a corner point diagram determination module 310 and a corner point extraction module 320.

The corner map determination module 310 is configured to determine a corner map 301 corresponding to the image 110 by performing corner detection on the image 110. The corner map 301 identifies the location of at least one corner in the image.

In some embodiments, the corner map determination module 310 obtains at least one corner of the image 110 by performing corner detection on the image 110. The at least one corner point may include, for example, a connection point of a graphic contour line in the image 110, a pixel point corresponding to a local maximum of a gray gradient, an intersection point of two or more edges, a point where a gradient value and a change rate of a gradient direction are high, and the like. After the corner point diagram determining module 310 obtains at least one corner point, a corner point diagram 301 corresponding to the image 110 is determined, where the dimensions and dimensions of the corner point diagram 301 and the image 110 may be the same.

The corner map 301 is provided to a corner extraction module 320 such that the corner extraction module 320 extracts the corner feature representation 201 from the corner map 301. In some embodiments, the corner extraction module 320 may extract the corner feature representation 201 using a trained corner extraction model. The corner extraction model may comprise, for example, a Convolutional Neural Network (CNN).

With continued reference to fig. 2. The first feature extraction module 220 is configured to extract a text feature representation 202 for a text sequence from the image 110. The text feature representation 202 includes a plurality of feature elements for a text sequence. The plurality of feature elements may, for example, characterize semantics, syntax, etc. of different text units (e.g., different words or characters) in the text sequence. In some embodiments, the feature elements of the text sequence exist in the form of vectors, and the text feature representation 202 may include a plurality of vectors corresponding to the plurality of feature elements of the text sequence. For example, if the text sequence is "A and B", then the vector corresponding to "A" is (a ₁ ,b ₁ ,c ₁ ,d ₁ ) The vector corresponding to "and" is (a) ₂ ,b ₂ ,c ₂ ,d ₂ ) The vector corresponding to "B" is (a) ₃ ,b ₃ ,c ₃ ,d ₃ ) The text feature representation 202 may contain these three vectors.

The corner feature representation 201 is provided to the attention module 230 together with the text feature representation 202. The attention module 230 determines the attention weight 203 for the text feature representation 202 based on the corner feature representation 201. In some embodiments, the attention weight 203 includes a weight value for each of a plurality of feature elements in the text feature representation 202. For example, the text sequence is "Aand B", and the attention weight 203 may include weight values of a, and, B, respectively.

In some embodiments, the attention module 230 may employ an attention mechanism to determine the attention weight 203 for the text feature representation 202. The attention module 230 may include, for example, a cross-attention module. The attention weight 203 is provided to the text recognition module 240 along with the text feature representation 202 such that the text recognition module 240 generates the text recognition result 130 for the text sequence based on the attention weight 203 and the text feature representation 202. The text recognition module 240 may employ a text recognition algorithm for text recognition or perform text recognition using a trained text recognition model. By introducing corner information as an aid in the process of extracting text feature representation of an image, character invariant features in a text sequence can be better extracted, and accuracy of text recognition is further improved.

In some embodiments, the example architecture 200 of text recognition may be based on an encoder and decoder architecture, such as the model architecture shown in fig. 3. The attention module 230 may be provided in an encoder model and the electronic device 120 may determine the attention weight 203 based on the corner feature representation 201 using the attention module 230 comprised in the encoder model.

Referring back to fig. 3. As shown in fig. 3, the architecture 200 may include an encoder model 330 and a decoder model 350 in addition to the corner detection module 210. The encoder model 330 includes a first feature extraction module 220, an attention module 230, and a second feature extraction module 340 for extracting a target text feature representation of the image 110. The image 110 is provided to the corner detection module 210 and the encoder model 330. The first feature extraction module 220 in the encoder model 330 provides the extracted text feature representation 202 to the attention module 230 along with the corner feature representation 201 extracted by the corner extraction module 320.

In some embodiments, the attention module 230 in the encoder model 330 may include a transducer (transducer) based attention module. The process of the transducer-based attention module 230 can be expressed as follows:

Wherein Q represents a query feature, K represents a key feature, V represents a value feature, d _k The number of columns, i.e., feature dimensions, representing Q and K. The above process can be understood as calculating an attention weight matrix using the Q query feature and the key feature K, and weighting the value feature V with the attention weight matrix.

In some embodiments, the corner feature representation 201 determined by the corner detection module 210 is defined as a query feature of the transducer-based attention module 230, and the text feature representation 202 is defined as a key feature (key) and a value feature (value) of the transducer-based attention module 230. In some embodiments, after the transducer-based attention module 230 calculates the attention weight 203 using the Q query feature and the key feature K, the second feature extraction module 340 performs a weighted summation of the text feature representations 202 using the attention weight 203 to obtain weighted text feature representations.

Further, the second feature extraction module 340 provides the weighted text feature representation to the decoder model 350, and the decoder model 350 may generate the text recognition result 130 based on the weighted text feature representation. The second feature extraction module 340 and the decoder model 350 in fig. 3 correspond to the text recognition module 240 in fig. 2.

In some embodiments, the attention module 230 also includes a transducer-based attention module that is a variant of the transducer, e.g., a transducer-based attention module. Such an example embodiment is described below with reference to fig. 4. Fig. 4 illustrates a schematic diagram of an example of implementing text recognition using a consumer model, according to some embodiments of the present disclosure. As shown in fig. 4, the electronic device 120 may include N encoder models 330 and N decoder models 350, where N is a positive integer greater than or equal to 1.

The first convolution module 405 in fig. 4 corresponds to the corner extraction module 320 in fig. 3, and the first convolution module 405 may be understood as the corner extraction module 320 employing CNN. The corner feature representation 201 determined by the first convolution module 405 is provided to the attention module 230 in the encoder model 330 after being added with text position information corresponding to each text in the text sequence.

The text position information is a position number, e.g. from 0 to N-1, for each text unit (e.g. word) in the text sequence. For example, the text sequence is "a and B", the text position information corresponding to a is 0, the text position information corresponding to a is 1, and the text position information corresponding to B is 2.

The second convolution module 410, the feedforward module 415, and the multi-headed attention module 420 in fig. 4 correspond to the first feature extraction module 220 in fig. 2 and 3, wherein the second convolution module 410 may be independent of the encoder model 330 (e.g., fig. 4) or may be disposed in the encoder model 330 (e.g., fig. 3).

The second convolution module 410 determines a text feature representation 202 based on the image 110, the text feature representation 202 being added with text position information corresponding to each text in the text sequence before being provided to the feed-forward module 415. The text feature representation 202 with added text position information is provided to the feed forward module 415-1 and the text feature representation 202 with added text position information is also provided to other modules in the encoder model 330 (i.e., the multi-headed attention module 420, the attention module 230, the convolution module 425, the feed forward module 415-2, and the normalization layer 435).

In some embodiments, the feed-forward module 415 (415-1 and 415-2) may comprise a feed-forward neural network (FFN) employing one or more fully connected layers. For example, in an FFN comprising two fully connected layers, the activation function of the first layer is a Relu activation function and the second layer does not use an activation function. The feedforward module 415 may play a role in spatial variation of the input data, may mine non-linear relationships of the features, and may enhance the expressive power of the features.

The multi-headed attention module 420 may use multiple sets of query features Q ', key features K', and value features V 'to derive a feature matrix for the final text feature representation 202, where Q', K ', V' are different projections of the text feature representation 202.

The attention module 230 may be an attention module employing a cross-attention mechanism, and the attention module 230 may combine two feature matrices of the same dimension (corner feature representation 201 and text feature representation 202) and use the corner feature representation 201 as a query feature Q input and the text feature representation 202 as a key feature K and a value feature V input.

The convolution module 425, the feedforward module 415-2 and the normalization layer 430 correspond to the first feature extraction module 220 in fig. 3. The encoder model 330 weights the text feature representation 202 with attention weights to obtain a weighted text feature representation, and a convolution module 425 included in the encoder model 330 may perform a convolution process on the weighted text feature representation to determine a target text feature representation for the text sequence. After processing the target text feature representation via the feed forward module 415-2 and the normalization layer 430 to make the target text feature more stable, the encoder model 330 provides the target text feature to the decoder model 350.

Decoder model 350 predicts each text unit in the text sequence one at a time, with previous output 435 referring to the text units (e.g., first and second text units) that have been previously predicted and output when predicting the current text unit (e.g., third text unit). The previous output 435 is processed by the character embedding module 440, embedded with the corresponding position information, and then input to the decoder model 350.

Each module in the decoder model 350 includes an ADD & NORM layer 450, and the ADD & NORM layer 450 may ADD a residual block X to the output of each module in the decoder model 350 and normalize the output, and the purpose of adding the residual block X is mainly to prevent degradation of the neural network used by each module in the decoder model 350.

Masking multi-headed attention module 445 may mask text units subsequent to the currently decoded text unit and may prevent the currently decoded text unit from knowing text information of the subsequent text unit. The masked previous output 435 is provided to the multi-head attention module 455 along with the target text feature representation output by the encoder model 330 by the masked multi-head attention module 445. The multi-headed note module 455 generates candidate recognition results for the text sequence from the target text feature representation, which are spatially varied by the feed-forward module 460 to yield an output feature vector.

After passing through the linear layer 465 and the SOFTMAX layer 470, a text recognition result 430 is obtained. The linear layer 465 is a simple fully-connected neural network that projects the output feature vector obtained by the decoder model 350 into a vector called log probabilities (logits) that is much larger than it. The numbers in the vector are then scaled into the probability value field of 0-1 by the SOFTMAX layer 470 and satisfy their sum as 1. The text element with the highest corresponding probability is selected as the output of this time step, resulting in a text recognition result 130. Thus, the text recognition of the image 110 using the corner feature representation 201 and the text feature representation 202 of the image 110 using the con model shown in fig. 4 may be implemented.

Example procedure

Fig. 5 illustrates a flow chart of a process for text recognition according to some embodiments of the present disclosure. The process 500 may be implemented at the electronic device 120. For ease of discussion, the process 500 will be described with reference to the environment 100 of FIG. 1.

At block 510, the electronic device 120 determines a corner feature representation of the image by performing corner detection on the image to be identified, the corner feature representation characterizing at least one corner in the image, the image comprising a text sequence.

At block 520, the electronic device 120 extracts a text feature representation for the text sequence from the image.

At block 530, the electronic device 120 determines an attention weight for the text feature representation based on the corner feature representation.

At block 540, the electronic device 120 generates a recognition result for the text sequence based on the text feature representation and the attention weight.

In some embodiments, the text feature representation includes a plurality of feature elements for the text sequence, and the attention weight includes a weight value for each of the plurality of feature elements.

In some embodiments, determining the attention weight for the text feature representation includes: the attention weight is determined based on the corner feature representation using an attention module included in the encoder model.

In some embodiments, the attention module comprises a converter-based attention module, and wherein the corner feature representation is defined as a query feature of the converter-based attention module and the text feature representation is defined as a key feature and a value feature of the converter-based attention module.

In some embodiments, determining the corner feature representation includes: determining a corner map corresponding to the image by performing corner detection on the image, the corner map identifying a position of at least one corner in the image; and extracting corner feature representations from the corner graphs.

In some embodiments, generating the recognition result for the text sequence includes: weighting the text feature representation with the attention weight to obtain a weighted text feature representation; determining a target text feature representation for the text sequence by performing a convolution process on the weighted text feature representation using a convolution module included in the encoder model; and generating, using the decoder model, recognition results for the text sequence from the target text feature representation.

In some embodiments, the text sequence includes at least one artistic word.

Example apparatus and apparatus

Fig. 6 illustrates a block diagram of an apparatus 600 for text recognition according to some embodiments of the present disclosure. The apparatus 600 may be implemented or included in the electronic device 120, for example. The various modules/components in apparatus 600 may be implemented in hardware, software, firmware, or any combination thereof.

As shown, the apparatus 600 comprises a corner detection module 610 configured to determine a corner feature representation of an image by performing corner detection on the image to be identified, the corner feature representation characterizing at least one corner in the image, the image comprising a text sequence. The apparatus 600 further comprises a text feature extraction module 620 configured to extract a text feature representation for the text sequence from the image. The apparatus 600 further comprises a weight determination module 630 configured to determine an attention weight for the text feature representation based on the corner feature representation. The apparatus 600 further comprises a result generation module 640 configured to generate a recognition result for the text sequence based on the text feature representation and the attention weight.

In some embodiments, the weight determination module 630 is further configured to: the attention weight is determined based on the corner feature representation using an attention module included in the encoder model.

In some embodiments, the corner detection module 610 includes: a corner map determining module configured to determine a corner map corresponding to the image by performing corner detection on the image, the corner map identifying a position of at least one corner in the image; and a corner feature extraction module configured to extract corner feature representations from the corner figures.

In some embodiments, the result generation module 640 includes: a weighted feature acquisition module configured to weight the text feature representation with the attention weight to obtain a weighted text feature representation; a target feature determination module configured to determine a target text feature representation for the text sequence by performing a convolution process on the weighted text feature representation using a convolution module included in the encoder model; and a recognition result generation module configured to generate a recognition result for the text sequence from the target text feature representation using the decoder model.

In some embodiments, the text sequence includes at least one artistic word.

Fig. 7 illustrates a block diagram of an electronic device 700 in which one or more embodiments of the disclosure may be implemented. It should be understood that the electronic device 700 illustrated in fig. 7 is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. The electronic device 700 shown in fig. 7 may be used to implement the electronic device 120 of fig. 1.

As shown in fig. 7, the electronic device 700 is in the form of a general-purpose electronic device. Components of electronic device 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage 730, one or more communication units 740, one or more input devices 750, and one or more output devices 760. The processing unit 710 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 720. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capabilities of electronic device 700.

Electronic device 700 typically includes a number of computer storage media. Such a medium may be any available media that is accessible by electronic device 700, including, but not limited to, volatile and non-volatile media, removable and non-removable media. The memory 720 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 730 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within electronic device 700.

The electronic device 700 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 7, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 720 may include a computer program product 725 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

The communication unit 740 enables communication with other electronic devices through a communication medium. Additionally, the functionality of the components of the electronic device 700 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Thus, the electronic device 700 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 750 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 760 may be one or more output devices such as a display, speakers, printer, etc. The electronic device 700 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., through the communication unit 740, with one or more devices that enable a user to interact with the electronic device 700, or with any device (e.g., network card, modem, etc.) that enables the electronic device 700 to communicate with one or more other electronic devices, as desired. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement the method described above, is provided. According to an exemplary embodiment of the present disclosure, there is also provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, and computer program products implemented according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

1. A method for text recognition, comprising:

determining corner feature representations of an image to be identified by performing corner detection on the image, wherein the corner feature representations represent at least one corner in the image, and the image comprises a text sequence;

extracting a text feature representation for the text sequence from the image;

determining an attention weight for the text feature representation based on the corner feature representation; and

a recognition result for the text sequence is generated based on the text feature representation and the attention weight.

2. The method of claim 1, wherein the text feature representation comprises a plurality of feature elements for the text sequence, the attention weight comprising a weight value for each of the plurality of feature elements.

3. The method of claim 1, wherein determining the attention weight for the text feature representation comprises:

the attention weight is determined based on the corner feature representation with an attention module included in the encoder model.

4. A method according to claim 3, wherein the attention module comprises a converter-based attention module, and wherein the corner feature representation is defined as a query feature of the converter-based attention module and the text feature representation is defined as a key feature and a value feature of the converter-based attention module.

5. The method of claim 1, wherein determining the corner feature representation comprises:

determining a corner map corresponding to the image by performing corner detection on the image, wherein the corner map identifies the position of at least one corner in the image; and

extracting the corner feature representation from the corner graph.

6. The method of claim 1, wherein generating a recognition result for the text sequence comprises:

weighting the text feature representation with the attention weight to obtain a weighted text feature representation;

determining a target text feature representation for the text sequence by performing a convolution process on the weighted text feature representation using a convolution module included in the encoder model; and

generating a recognition result for the text sequence from the target text feature representation using a decoder model.

7. The method of claim 1, wherein the text sequence includes at least one artistic word.

8. An apparatus for text recognition, comprising:

a corner detection module configured to determine a corner feature representation of an image by performing corner detection of the image to be identified, the corner feature representation characterizing at least one corner in the image, the image comprising a text sequence;

a text feature extraction module configured to extract a text feature representation for the text sequence from the image;

a weight determination module configured to determine an attention weight for the text feature representation based on the corner feature representation; and

A result generation module configured to generate a recognition result for the text sequence based on the text feature representation and the attention weight.

9. The apparatus of claim 8, wherein the text feature representation comprises a plurality of feature elements for the text sequence, the attention weight comprising a weight value for each of the plurality of feature elements.

10. The apparatus of claim 8, wherein the weight determination module is further configured to:

11. The apparatus of claim 10, wherein the attention module comprises a converter-based attention module, and wherein the corner feature representation is defined as a query feature of the converter-based attention module and the text feature representation is defined as a key feature and a value feature of the converter-based attention module.

12. The apparatus of claim 8, wherein the corner detection module comprises:

a corner map determining module configured to determine a corner map corresponding to the image by performing corner detection on the image, the corner map identifying a location of at least one corner in the image; and

And the corner feature extraction module is configured to extract the corner feature representation from the corner graph.

13. The apparatus of claim 8, wherein the result generation module comprises:

a weighted feature acquisition module configured to weight the text feature representation with the attention weight to obtain a weighted text feature representation;

a target feature determination module configured to determine a target text feature representation for the text sequence by performing a convolution process on the weighted text feature representation using a convolution module included in the encoder model; and

and a recognition result generation module configured to generate a recognition result for the text sequence from the target text feature representation using a decoder model.

14. The apparatus of claim 8, wherein the text sequence comprises at least one artwork.

15. An electronic device, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the apparatus to perform the method of any one of claims 1 to 7.

16. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 7.