CN114863437A

CN114863437A - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114863437A
Application number: CN202210425713.7A
Authority: CN
Inventors: 张晓强; 黄聚; 钦夏孟; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-05
Anticipated expiration: 2042-04-21
Also published as: CN114863437B

Abstract

The application discloses a text recognition method, a text recognition device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, specifically to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The specific scheme is as follows: acquiring a text image to be recognized; performing feature extraction on the text image to be recognized to acquire image features of the text image to be recognized; extracting each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vectors, and determining attention features corresponding to the text instances; and decoding the attention characteristics corresponding to the text examples to generate the recognition results corresponding to the text examples. According to the method, the text examples are divided into vectors by the aid of the text examples, the text examples are corresponding and distinguished to obtain the attention characteristics at the example level, the recognition results at the example level are obtained according to the attention characteristics at the example level, complex manual post-processing is not needed, and accuracy of the text recognition results in natural scenes is improved.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, can be applied to scenes such as Optical Character Recognition (OCR) and the like, and particularly relates to a text Recognition method and device, electronic equipment and a storage medium.

Background

Natural scene text recognition refers to recognizing text images of natural scenes. The text in the natural scene has more complicated characteristics such as illumination, angle, shielding, font and background interference, and the text recognition effect is influenced.

Therefore, how to improve the following recognition effect in a natural scene is a problem to be solved urgently.

Disclosure of Invention

The application provides a text recognition method and device, electronic equipment and a storage medium. The specific scheme is as follows:

according to an aspect of the present application, there is provided a text recognition method including:

acquiring a text image to be recognized;

performing feature extraction on the text image to be recognized to acquire image features of the text image to be recognized;

extracting each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vectors, and determining attention features corresponding to the text instances;

and decoding the attention characteristics corresponding to the text examples to generate the recognition results corresponding to the text examples.

According to another aspect of the present application, there is provided a text recognition apparatus including:

the first acquisition module is used for acquiring a text image to be recognized;

the second acquisition module is used for extracting the features of the text image to be recognized so as to acquire the image features of the text image to be recognized;

the determining module is used for extracting each text example in the text image to be recognized according to the image characteristics and the preset text example segmentation vector and determining the attention characteristics corresponding to each text example;

and the generating module is used for decoding the attention characteristics corresponding to the text examples to generate the recognition results corresponding to the text examples.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the above-described embodiments.

According to another aspect of the present application, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the method of the above-mentioned embodiment.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a text recognition method according to another embodiment of the present application;

fig. 3 is a schematic flowchart of a text recognition method according to another embodiment of the present application;

fig. 4 is a schematic flowchart of a text recognition method according to another embodiment of the present application;

fig. 5 is a schematic flowchart of a text recognition method according to another embodiment of the present application;

FIG. 6 is a process diagram of text recognition provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a text recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A text recognition method, an apparatus, an electronic device, and a storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

Image processing is a technique of analyzing an image with a computer to achieve a desired result, and is also called image processing. Image processing generally refers to the processing of digital images, which refers to a large two-dimensional array obtained by shooting with equipment such as industrial cameras, video cameras, scanners, etc., the elements of the array are called pixels, and the values are called gray values.

Computer vision is a science for researching how to make a machine "see", and means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect.

Fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present application.

The text recognition method can be executed by the text recognition device in the embodiment of the application, and the device can be configured in electronic equipment to realize the correspondence and the distinction of the text examples in the text image to be recognized through the text example segmentation vector, decode the attention characteristics corresponding to the text examples, generate the recognition results corresponding to the text examples, does not need complex manual post-processing, and can effectively improve the accuracy of the recognition results in natural scenes.

The electronic device may be any device with computing capability, for example, a personal computer, a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and the like.

As shown in fig. 1, the text recognition method includes:

step 101, acquiring a text image to be recognized.

In the application, the text image to be recognized may be an image containing characters, a text image obtained by real-time shooting in a natural scene, a text image shot in advance, or a frame image in a video.

And 102, performing feature extraction on the text image to be recognized to acquire image features of the text image to be recognized.

In the present application, the text image recognition model may be obtained through pre-training, and the recognition granularity may be determined as needed, for example, recognition may be performed according to a word level, or recognition may be performed according to a sentence level.

In the process of acquiring the image to be recognized, the text image to be recognized can be input to a feature coding module of the text image recognition model for image feature extraction, so as to acquire the image features of the text image to be recognized.

When implemented, the feature coding module may adopt a CNN (Convolutional Neural Networks), or a transform coder, or a hybrid structure of the two.

And 103, extracting each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vectors, and determining attention features corresponding to each text instance.

In the application, the preset text instance segmentation vector can be regarded as a parameter in the text image recognition model, and the text instance segmentation vector can learn to obtain the characteristics of each text instance in the text image to be recognized.

The text instance segmentation vector may include a plurality of line vectors, and each line vector corresponds to one text instance. For example, the text instance segmentation vector includes N line vectors of D dimensions, where N and D are positive integers, N text instances may be extracted from the text image to be recognized, and one word may be used as one text instance, or one sentence may be used as one text instance, which corresponds to the recognition granularity of the text image recognition model.

In the application, each text instance in the text image to be recognized can be extracted from the text image to be recognized by using the text instance segmentation vector, and the attention feature corresponding to each text instance is determined.

Wherein each text instance has corresponding attention characteristics, for example, the text instance segmentation vector includes N line vectors, then N text instances may be extracted from the text image to be recognized, and the attention characteristics corresponding to each text instance may be determined.

In the application, the value of N in the text instance segmentation vector may be determined according to the application scenario, and may be the maximum value of the number of text instances in the application scenario.

And 104, decoding the attention characteristics corresponding to the text examples to generate recognition results corresponding to the text examples.

In the application, the attention feature corresponding to each text instance may be decoded to generate a recognition result corresponding to each text instance. Thus, according to the attention characteristics of the example level, the identification result of the example level can be obtained.

In the embodiment of the application, a text image to be identified is obtained; performing feature extraction on the text image to be recognized to acquire image features of the text image to be recognized; extracting each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vectors, and determining attention features corresponding to the text instances; and decoding the attention characteristics corresponding to the text examples to generate the recognition results corresponding to the text examples. Therefore, the text examples are divided into vectors by the aid of the text examples, the text examples are corresponding and distinguished to obtain the attention characteristics at the example level, the recognition results at the example level are obtained according to the attention characteristics at the example level, complex manual post-processing is not needed, and accuracy of the text recognition results in natural scenes is effectively improved.

In an embodiment of the present application, one or more predictions may be performed by using the attention characteristics corresponding to each text instance, so as to obtain one or more prediction results at an instance level, which is described below with reference to fig. 2. Fig. 2 is a schematic flowchart of a text recognition method according to another embodiment of the present application.

As shown in fig. 2, the text recognition method includes:

step 201, acquiring a text image to be recognized.

Step 202, performing feature extraction on the text image to be recognized to obtain image features of the text image to be recognized.

And 203, extracting each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vector, and determining the attention features corresponding to each text instance.

In the present application, steps 201 to 203 can refer to the related contents in the above embodiments, and therefore are not described herein again.

Step 204, the attention features corresponding to the text examples are respectively input to one or more of the detection network, the identification network and the classification network for decoding, so as to obtain the identification results corresponding to the text examples.

After obtaining the attention feature corresponding to each text instance, the attention feature corresponding to each text instance may be input to a detection network for decoding, so as to obtain a detection result corresponding to each text instance, for example, obtain a detection bounding box corresponding to each text instance.

Or the attention characteristics corresponding to each text instance can be input into the recognition network to obtain the text recognition result corresponding to each text instance.

Or, the attention features corresponding to the text instances may also be input to the classification network to obtain a classification result corresponding to each text instance, for example, whether the classification result corresponding to each text instance is a text or not.

Alternatively, the attention feature corresponding to each text instance may be input into multiple ones of the detection network, the recognition network, and the classification network, and decoded respectively to obtain a corresponding recognition result for each text instance.

For example, the attention features corresponding to the text instances are input to the detection network, the recognition network and the classification network in parallel for decoding, so that the detection result, the recognition result and the classification result corresponding to each text instance can be obtained. Therefore, the text examples in the text image to be recognized can be in one-to-one correspondence and distinguished through the text example segmentation vectors, and the classification result, the detection result and the text recognition result of the example level are predicted in parallel, so that the text recognition result does not need to depend on the detection result and complex manual post-processing, and the accuracy of the text recognition result in a natural scene is improved.

In the embodiment of the present application, when the attention features corresponding to the text instances are decoded to generate the recognition results corresponding to the text instances, the attention features corresponding to the text instances may be respectively input to one or more of the detection network, the recognition network, and the classification network for decoding, so as to obtain the recognition results corresponding to the text instances. Therefore, the attention characteristics corresponding to each text instance can be input into one or more of the detection network, the identification network and the classification network for decoding, diversified identification requirements are met, and if the attention characteristics are input into the multiple networks for decoding, the detection result, the text identification result and the classification result at the instance level can be predicted in parallel, so that the text identification result does not need to depend on the detection result, and the accuracy of the text identification result in a natural scene is improved.

In an embodiment of the present application, when determining attention features corresponding to text instances, the extracted image features may be used to decode to obtain text features at an instance level, and the image features and the text features at the instance level are fused to obtain the attention features at the instance level, which is described below with reference to fig. 3, where fig. 3 is a flowchart of a text recognition method provided in another embodiment of the present application.

As shown in fig. 3, the text recognition method includes:

step 301, acquiring a text image to be recognized.

Step 302, performing feature extraction on the text image to be recognized to obtain image features of the text image to be recognized.

In the present application, steps 301 to 302 can refer to the related contents in the above embodiments, and therefore are not described herein again.

Step 303, decoding the image features and the text example segmentation vectors to extract each text example in the text image to be recognized, and obtaining text features corresponding to each text example.

In the application, the image features and the text instance segmentation vectors can be input into a decoding module of a text image recognition network for decoding, so that each text instance in the text image to be recognized is extracted from the image features by using the text instance segmentation vectors, and the text features corresponding to each text instance are obtained.

In implementation, the decoding module may include a self-attention layer and a cross-attention layer, and may input the text instance segmentation vector to the self-attention layer for decoding to obtain an intermediate feature corresponding to the text instance segmentation vector, and input the intermediate feature and the image feature to the cross-attention layer for decoding to extract each text instance in the text image to be recognized, and obtain a text feature corresponding to each text instance. Therefore, the self-attention layer and the cross-attention layer are utilized to decode the image features and the text example segmentation vectors, and the accuracy of the text features is improved.

For example, if the text instance segmentation vector includes N line vectors, the text feature corresponding to each of the N text instances may be obtained.

When implemented, the decoding module may employ a transform decoder.

And step 304, fusing the image features and the text features corresponding to the text instances to determine the attention features corresponding to the text instances.

After the text features corresponding to the text instances are obtained, the text features corresponding to each text instance and the image features may be subjected to element-by-element dot multiplication to determine the attention features corresponding to each text instance.

Because the larger the scale of the image feature is, the stronger the representation capability of the contained geometric detail information is, if the image feature is a multi-scale image feature, the image feature with the largest scale and the text feature corresponding to each text instance may be subjected to dot multiplication to determine the attention feature corresponding to each text instance. Therefore, the characteristics are refined, and the accuracy of the identification result is improved.

Step 305, decoding the attention features corresponding to the text examples to generate recognition results corresponding to the text examples.

In the present application, step 305 may refer to the related contents in the above embodiments, and therefore, the details are not described herein again.

In the embodiment of the application, when each text instance in the text image to be recognized is extracted according to the image features and the preset text instance segmentation vectors and the attention features corresponding to the text instances are determined, the image features and the text instance segmentation vectors can be decoded to obtain the text instances and the text features corresponding to the text instances, and the image features and the text features corresponding to the text instances are fused to determine the attention features corresponding to the text instances. Therefore, the attention feature at the instance level can be obtained by fusing the image feature and the text feature at the instance level, and the identification result at the instance level is obtained based on the attention feature at the instance level, so that the accuracy of the identification result is improved.

In an embodiment of the present application, the image features obtained by feature extraction of the text image to be recognized may be multi-scale image features, and if large-scale image features are used for decoding, the amount of calculation is large, which affects the recognition efficiency, and based on this, the text recognition method shown in fig. 4 may be adopted. Fig. 4 is a flowchart illustrating a text recognition method according to another embodiment of the present application.

As shown in fig. 4, the text recognition method includes:

step 401, acquiring a text image to be recognized.

And 402, performing feature extraction on the text image to be recognized to obtain multi-scale image features of the text image to be recognized.

In the present application, steps 401 to 402 can refer to the related contents in the above embodiments, and therefore are not described herein again.

In step 403, image features with a scale smaller than a first threshold are obtained from the multi-scale image features.

In the present application, the extracted image features are multi-scale image features, that is, image features of different scales. The image features with small scale have strong semantic information representation capability, but the geometric information representation capability is weak; the image features with large scale have strong geometric detail information representation capability, but have weak semantic information representation capability.

In order to reduce the amount of calculation, in the present application, the scale of each image feature may be compared with the first threshold to obtain an image feature with a scale smaller than the first threshold.

It should be noted that the first threshold may be determined according to actual needs, for example, the first threshold may be a maximum scale in the multi-scale image feature, or may also be other values, which is not limited in this application.

Step 404, decoding the image features and the text instance segmentation vectors smaller than the first threshold value to extract each text instance in the text image to be recognized, and obtaining text features corresponding to each text instance.

In order to reduce the amount of calculation, in the present application, the text instance segmentation vector may be input to a self-attention layer in a decoding module for decoding, so as to obtain intermediate features corresponding to the text instance segmentation vector, and the intermediate features and the image features having a scale smaller than a first threshold are input to a cross-attention layer in the decoding module for decoding, so as to extract each text instance in the text image to be recognized, and obtain text features corresponding to each text instance.

And step 405, fusing the multi-scale image features and the text features corresponding to the text examples to determine the attention features corresponding to the text examples.

After the text features corresponding to each text instance are obtained, the image features of which the scale is larger than the first threshold in the multi-scale image features and the text features corresponding to each text instance can be fused, and the image features of which the scale is the largest in the multi-scale image features and the text features corresponding to each text instance can also be fused to obtain the attention features corresponding to each text instance.

And 406, decoding the attention features corresponding to the text instances to generate recognition results corresponding to the text instances.

In the present application, step 406 may refer to the related contents recorded in the above embodiments, and therefore, the description thereof is omitted here.

In the embodiment of the application, the image features may be multi-scale image features, when the text features corresponding to the text instances are obtained, the image features of which the scale is smaller than the first threshold in the multi-scale image features may be obtained, and the image features and the text instance segmentation vectors smaller than the first threshold are decoded to extract the text instances in the text image to be recognized, and the text features corresponding to the text instances are obtained. Therefore, when the image features are multi-scale image features, the image features corresponding to the text examples can be obtained by utilizing the image features with small scale in the multi-scale image features and the text example segmentation vectors, the calculation amount is reduced, and the recognition efficiency is improved.

In an embodiment of the present application, the image features obtained by feature extraction on the text image to be recognized may be multi-scale image features, and the text recognition method shown in fig. 5 may be adopted during recognition. Fig. 5 is a schematic flowchart of a text recognition method according to another embodiment of the present application.

As shown in fig. 5, the text recognition method includes:

step 501, obtaining a text image to be recognized.

Step 502, performing feature extraction on the text image to be recognized to obtain multi-scale image features of the text image to be recognized.

In the present application, steps 501 to 502 can refer to the related contents in the above embodiments, and therefore are not described herein again.

Step 503, decoding the multi-scale image features and the text example segmentation vectors to extract each text example in the text image to be recognized, and acquiring text features corresponding to each text example.

In the method and the device, the image features and the text instance segmentation vectors of all scales in the multi-scale image features can be decoded to extract each text instance in the text image to be recognized, and the text features corresponding to each text instance are obtained, so that the text features corresponding to the obtained text instances contain more information, and the recognition accuracy is improved. Or, decoding the image features and the text instance segmentation vectors of which the scale is greater than the second threshold in the multi-scale image features to extract each text instance in the text image to be recognized, and acquiring the text features corresponding to each text instance, so that the text features contain more geometric detail information.

The second threshold is greater than or equal to the first threshold, and the second threshold may be determined according to actual needs, which is not limited in the present application.

During decoding, a similar decoding method to that in the above embodiment may be adopted, and therefore, the description is omitted here.

And step 504, fusing the image features of which the scale is larger than the second threshold value in the multi-scale image features and the text features corresponding to the text examples to determine the attention features corresponding to the text examples.

In the application, the image features with the scale larger than the second threshold value in the multi-scale image features can be obtained, and the image features with the scale larger than the second threshold value and the text features corresponding to each text instance are subjected to element-by-element dot multiplication to determine the attention features corresponding to each text instance.

For example, the image feature with the scale larger than the second threshold in the multi-scale image features is the image feature with the maximum scale, and the image feature with the maximum scale and the text feature corresponding to each text instance can be fused to obtain the attention feature corresponding to each instance text.

And 505, decoding the attention characteristics corresponding to the text examples to generate recognition results corresponding to the text examples.

In the present application, step 505 can refer to the related contents in the above embodiments, and therefore, is not described herein again.

In this embodiment of the application, the image feature may be a multi-scale image feature, and when the image feature and the text feature corresponding to each text instance are fused to determine the attention feature corresponding to each text instance, the image feature of which the scale is greater than the second threshold in the multi-scale image feature and the text feature corresponding to each text instance may be fused to determine the attention feature corresponding to each text instance. Therefore, when the image features are multi-scale image features, the image features with larger scales in the multi-scale image features and the text features corresponding to the text examples can be fused, the features can be refined, the accuracy of the recognition result is improved, and the recognition effect is improved.

To further illustrate the above embodiments, the following description is made with reference to fig. 6, and fig. 6 is a schematic diagram of a text recognition process provided in an embodiment of the present application.

As shown in fig. 6, the text image to be recognized may be input to the feature encoding module to obtain the image features of the text image to be recognized. Wherein the image features are multi-scale image features; the feature coding module is based on CNN or Transformer coder or mixed network structure.

After obtaining the multi-scale image features, image features (which may be referred to as low-resolution image features) other than the maximum scale in the multi-scale image features and preset text instance segmentation vectors may be input to a decoding module for decoding, so as to obtain text features corresponding to each text instance. When the method is implemented, the decoding module may be a Transformer decoding module, the Transformer decoding module may include a self-attention layer and a cross-attention layer, the text instance segmentation vector may be input to the self-attention layer, the output from the attention layer and the low-resolution image feature may be input to the cross-attention layer, and the cross-attention layer outputs the text feature corresponding to each text instance.

Then, the text features of each text instance and the image features with the largest scale (which may be referred to as high-resolution image features) in the multi-scale image features may be subjected to dot multiplication to obtain the attention features of each text instance, and the attention features corresponding to each text instance are input into three task branches of the multi-task module for recognition, detection and classification to be decoded to obtain a text recognition result, a detection result and a classification result corresponding to each text instance.

The classification branch and the detection branch can both be formed by using simple convolution layers, wherein the detection branch can obtain a segmentation mask firstly and then obtain an outer enclosure frame through simple connected domain analysis, and the identification branch can encode the features into sequence vectors firstly and then send the sequence vectors into an identification head formed by a Transformer.

In the embodiment shown in fig. 6, the low-resolution image features and the text instance segmentation vectors in the extracted image features may be used for decoding to obtain text features corresponding to each text instance, the high-resolution image features and the text features corresponding to each text instance are subjected to dot multiplication to obtain attention features corresponding to each text instance, and the attention features corresponding to each text instance are respectively input to three task branches of recognition, detection, and classification to be decoded to obtain a text recognition result, a detection result, and a classification result corresponding to each instance.

The text recognition method can reduce the calculated amount, simultaneously can refine the features, improve the recognition accuracy, and can predict the text recognition result, the detection result and the classification result of the instance level in parallel by corresponding and distinguishing the text instances one by one through the text instance segmentation vectors, so that the recognition result does not depend on the detection result, complex manual post-processing is not needed, and the accuracy of the end-to-end recognition result of characters in any shape in a natural scene can be effectively improved.

In addition, in the training phase, the prediction result and the labeling result can be matched, and category loss, mask loss and text loss can be calculated, wherein the mask loss can comprise cross entropy loss and semantic segmentation loss of the two classes, and the text loss and the classification loss can use the cross entropy loss. Therefore, in the training stage, the tasks can be jointly optimized and trained, so that the accuracy of the model can be improved, and the accuracy of the recognition result can be improved.

In order to implement the foregoing embodiments, an embodiment of the present application further provides a text recognition apparatus. Fig. 7 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application.

As shown in fig. 7, the text recognition apparatus 700 includes:

a first obtaining module 710, configured to obtain a text image to be recognized;

the second obtaining module 720 is configured to perform feature extraction on the text image to be recognized to obtain image features of the text image to be recognized;

the determining module 730 is configured to extract each text instance in the text image to be recognized according to the image features and the preset text instance segmentation vectors, and determine attention features corresponding to each text instance;

the generating module 740 is configured to decode the attention feature corresponding to each text instance to generate a recognition result corresponding to each text instance.

In a possible implementation manner of the embodiment of the present application, the generating module 740 is configured to:

and respectively inputting the attention characteristics corresponding to each text example into one or more of a detection network, an identification network and a classification network for decoding to generate an identification result corresponding to each text example, wherein the identification result comprises one or more of the detection result, the text identification result and the classification result.

In a possible implementation manner of this embodiment of the present application, the determining module 730 includes:

the acquiring unit is used for decoding the image characteristics and the text example segmentation vectors so as to extract each text example in the text image to be recognized and acquire the text characteristics corresponding to each text example;

and the fusion unit is used for fusing the image features and the text features corresponding to the text instances so as to determine the attention features corresponding to the text instances.

In a possible implementation manner of the embodiment of the present application, the obtaining unit is configured to:

inputting the text example segmentation vectors into a self-attention layer in a decoding module for decoding so as to obtain intermediate features corresponding to the text example segmentation vectors;

and inputting the intermediate features and the image features into a cross attention layer in a decoding module for decoding so as to extract each text example in the text image to be recognized and obtain the text features corresponding to each text example.

In a possible implementation manner of the embodiment of the present application, the image feature is a multi-scale image feature, and the obtaining unit is configured to:

acquiring image features of which the scale is smaller than a first threshold in the multi-scale image features;

and decoding the image features and the text example segmentation vectors which are smaller than the first threshold value so as to extract each text example in the text image to be recognized and acquire the text features corresponding to each text example.

In a possible implementation manner of the embodiment of the present application, the image feature is a multi-scale image feature, and the fusion unit is configured to:

and fusing image features with the scale larger than a second threshold value in the multi-scale image features and text features corresponding to the text instances to determine attention features corresponding to the text instances, wherein the second threshold value is larger than the first threshold value.

It should be noted that the explanation of the foregoing text recognition method embodiment is also applicable to the text recognition apparatus of this embodiment, and therefore will not be described herein again.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the device 800 includes a computing unit 801 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 802 or a computer program loaded from a storage unit 808 into a RAM (Random Access Memory) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An I/O (Input/Output) interface 805 is also connected to the bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 801 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a text recognition method. For example, in some embodiments, the text recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM803 and executed by the computing unit 801, a computer program may perform one or more steps of the text recognition method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the text recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, which when executed by an instruction processor in the computer program product, performs the text recognition method proposed by the above-mentioned embodiment of the present application.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A text recognition method, comprising:

acquiring a text image to be recognized;

performing feature extraction on the text image to be recognized to obtain image features of the text image to be recognized;

extracting each text instance in the text image to be recognized according to the image features and preset text instance segmentation vectors, and determining attention features corresponding to the text instances;

2. The method as claimed in claim 1, wherein said decoding the attention feature corresponding to each text instance to generate the recognition result corresponding to each text instance comprises:

and respectively inputting the attention characteristics corresponding to the text examples into one or more of a detection network, an identification network and a classification network for decoding to generate identification results corresponding to the text examples, wherein the identification results comprise one or more of the detection results, the text identification results and the classification results.

3. The method of claim 1, wherein the extracting each text instance in the text image to be recognized according to the image feature and a preset text instance segmentation vector and determining the attention feature corresponding to each text instance comprises:

decoding the image features and the text example segmentation vectors to extract each text example in the text image to be recognized and obtain text features corresponding to each text example;

and fusing the image features and the text features corresponding to the text instances to determine the attention features corresponding to the text instances.

4. The method of claim 3, wherein the decoding the image features and the text instance segmentation vectors to extract the text instances in the text image to be recognized and obtain the text features corresponding to the text instances comprises:

inputting the text instance segmentation vectors into a self-attention layer in a decoding module for decoding so as to obtain intermediate features corresponding to the text instance segmentation vectors;

and inputting the intermediate features and the image features into a cross attention layer in the decoding module for decoding so as to extract each text instance in the text image to be recognized and acquire text features corresponding to each text instance.

5. The method of claim 3, wherein the image features are multi-scale image features, and the decoding the image features and the text instance segmentation vectors to extract the text instances in the text image to be recognized and obtain the text features corresponding to the text instances comprises:

and decoding the image features smaller than the first threshold value and the text example segmentation vectors to extract each text example in the text image to be recognized and acquire text features corresponding to each text example.

6. The method of claim 3, wherein the image feature is a multi-scale image feature, and the fusing the image feature and the text feature corresponding to each text instance to determine the attention feature corresponding to each text instance comprises:

and fusing image features of which the scale is larger than a second threshold value in the multi-scale image features and the text features corresponding to the text instances to determine the attention features corresponding to the text instances, wherein the second threshold value is larger than the first threshold value.

7. A text recognition apparatus comprising:

the determining module is used for extracting each text example in the text image to be recognized according to the image characteristics and preset text example segmentation vectors and determining attention characteristics corresponding to each text example;

and the generating module is used for decoding the attention characteristics corresponding to the text examples to generate the identification results corresponding to the text examples.

8. The apparatus of claim 7, wherein the generating means is configured to:

9. The apparatus of claim 7, wherein the means for determining comprises:

the obtaining unit is used for decoding the image features and the text example segmentation vectors so as to extract each text example in the text image to be recognized and obtain text features corresponding to each text example;

and the fusion unit is used for fusing the image characteristics and the text characteristics corresponding to the text examples to determine the attention characteristics corresponding to the text examples.

10. The apparatus of claim 9, wherein the obtaining unit is configured to:

and inputting the intermediate features and the image features into a cross attention layer in the decoding module for decoding so as to extract each text instance in the text image to be recognized and obtain text features corresponding to each text instance.

11. The apparatus of claim 9, wherein the image feature is a multi-scale image feature, the obtaining unit to:

12. The apparatus of claim 9, wherein the image feature is a multi-scale image feature, and the fusion unit is configured to:

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 6.