CN116343190B

CN116343190B - Natural scene character recognition method, system, equipment and storage medium

Info

Publication number: CN116343190B
Application number: CN202310623773.4A
Authority: CN
Inventors: 张勇东; 王裕鑫; 谢洪涛
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-29
Anticipated expiration: 2043-05-30
Also published as: CN116343190A

Abstract

The invention discloses a natural scene character recognition method, a system, equipment and a storage medium, which are one-to-one schemes, wherein: the method and the device have the advantages that images are encoded into a vector space, so that local and global multi-granularity semantics are given, global vectors are obtained through aggregation, and channel attention diagrams of different time steps are generated in parallel, so that character information of different time steps is decoded, and due to the fact that a vector-to-sequence decoding mode is adopted, not only can recognition speed be improved, but also different characters share some feature expressions (such as attention diagrams are activated strongly) in the channel space, but some channel weights with distinguishing force features also have certain differences, therefore, the fact that the global vectors can generate robust character feature expressions (such as the fact that lack of attention to shared channel features does not affect the expression of distinguishing force channel features) can be ensured under the condition of low-quality attention diagrams, and therefore, the scheme provided by the invention can accurately recognize characters of natural scenes.

Description

Natural scene character recognition method, system, equipment and storage medium

Technical Field

The present invention relates to the field of natural scene text recognition technologies, and in particular, to a natural scene text recognition method, system, device, and storage medium.

Background

The natural scene character recognition is a universal character recognition technology, has become a hot research direction in the fields of computer vision and document analysis in recent years, and is widely applied to the fields of automatic driving, license plate recognition, vision impaired people assistance and the like. The goal of this task is to convert the text content in the image into editable text.

Since characters in a natural scene have the characteristics of low resolution, complex background, easiness in noise interference and the like, the traditional character recognition technology cannot be applied to the natural scene. Therefore, the character recognition technology in the natural scene has great research significance.

With the development of deep learning technology in the field of computer vision in recent years, the recent scene character recognition method achieves a better effect. In the text recognition process, as shown in fig. 1, an input image is firstly encoded into a sequence signal, and the sequence-to-sequence decoding mechanism is realized through a CNN (convolutional neural network); the sequence of character information is then decoded by the alignment structure, which is implemented by a sequence-to-sequence decoder, which may be an attention-based decoder or a CTC (connection timing classification) based decoder, and the characters provided at the top of fig. 1 are examples. However, the sequence-to-sequence alignment structure is complex in design, and the speed and the robustness of the character recognition process cannot be effectively balanced, so that the speed and the accuracy of scene character recognition are still to be improved.

Disclosure of Invention

The invention aims to provide a natural scene character recognition method, a system, equipment and a storage medium, which can rapidly and accurately recognize characters of a natural scene.

The invention aims at realizing the following technical scheme:

a natural scene text recognition method comprises the following steps:

step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;

step 2, aggregating the multi-granularity visual feature vectors to obtain global vectors;

and 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step.

A natural scene text recognition system, comprising:

the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;

the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;

and the vector-to-sequence decoder is used for generating a channel attention map of each time step in parallel by using the global vector, obtaining a character feature vector of each time step by combining the global vector, and predicting the character of each time step by using the character feature vector of each time step.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, images are encoded into a vector space, so that local and global multi-granularity semantics are endowed, global vectors are obtained through aggregation, and then channel attention patterns with different time steps are generated in parallel, so that character information with different time steps is decoded; meanwhile, the invention simplifies the sequence-to-sequence decoding mode, adopts a vector-to-sequence decoding mode, and therefore, the recognition speed is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a conventional scene text recognition method provided by the background of the invention;

FIG. 2 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a natural scene text recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a visual channel attention map according to an embodiment of the present invention;

FIG. 6 is a diagram comparing the present invention with the conventional scene text recognition method according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of a natural scene text recognition system according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The following describes a natural scene text recognition method, system, equipment and storage medium. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.

Example 1

The embodiment of the present invention provides a natural scene text recognition method, fig. 2 illustrates the main principle of the method, unlike the existing scene text recognition method illustrated in fig. 1, the encoder of the present invention is an image-to-vector encoder, wherein ViT is a visual Transformer network responsible for encoding an input image into a plurality of granularity visual feature vectors, each granularity visual feature vector being given local and global multi-granularity semantics, and ViT networks provided herein are only examples; thereafter, global vectors (not shown in fig. 2) are obtained by aggregation; the global vector is decoded in parallel by a vector-to-sequence decoder to obtain character information of different time steps, and the characters provided at the top of fig. 2 are all examples; in a vector-to-sequence decoder, the present invention uses a channel attention approach to decode the character information for each time step from the global vector. Experimental results show that the method provided by the invention achieves the most advanced performance in scene character recognition tasks. The following describes the above method in detail, fig. 3 shows the main flow of the above method, and fig. 4 shows the relevant framework of the above method, and the above method mainly includes the following steps:

step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module.

In the embodiment of the present invention, step 1 is implemented by an encoder, which may be a ViT network listed above.

As shown in fig. 4, the encoder mainly includes an embedded layer and a multi-layer transducer module. The embedded layer is mainly responsible for converting the natural scene image to be identified into sequence information, for example, the size of the natural scene image to be identified is 128×32, and the natural scene image to be identified is converted into sequence information with the length of 32. The multi-layer transducer module is mainly responsible for extracting multi-granularity visual feature vectors from the sequence information, where N in fig. 4 is the number of layers of the transducer module, for example, n=12 may be set.

And 2, aggregating the multi-granularity visual feature vectors to obtain a global vector.

In the embodiment of the invention, the step 2 can be realized by a feature aggregation module, and global average pooling operation can be adopted during aggregation, and the sum average of all the visual feature vectors is calculated to obtain a global vector. The global vector inputs a vector-to-sequence based decoder for decoding the characters at different time steps.

In the embodiment of the invention, the step 3 can be realized by a decoder of a vector-to-sequence, and the main process is as follows:

(1) And generating a channel attention map of each time step in parallel by using the global vector. And generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.

The manner in which the channel attention map for a single time step is generated is expressed as:

；

wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.

(2) The channel attention of each time step is striven for combining the global vector to obtain the character feature vector of each time step; specifically, the channel attention map of each time step may be multiplied by the global vector point to obtain the character feature vector of each time step.

；

wherein ,is the character feature vector of time step t.

(3) Classifying character feature vectors of each time step through a fully-connected classification layer, and predicting characters of each time step; the prediction mode is expressed as:

；

wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The classification layer is fully connected.

As shown in fig. 4, the parallel channel attention map calculation module in the vector-to-sequence decoder is mainly responsible for executing the (1) th to (2) th portions, and the fully-connected classification layer is responsible for executing the (3) th portion. In fig. 4, 1× C, M × C, M ×k each represents a dimension, for example, 1×c is a dimension of a global vector, C represents the number of channels, M represents the maximum number of characters (i.e., the maximum time step), and K represents the number of categories of characters.

Fig. 5 shows an example of a visual channel attention map, where the vector-to-sequence based decoder has the property of feature multiplexing, i.e. different character classes share some feature expressions in the channel space (e.g. channels where all attention maps are strongly activated). However, some channel weights with distinguishing features (i.e., associated channel features) also differ somewhat. Thus, feature reusability ensures that global vectors can also generate robust character feature vectors in the case of low quality attention seeking (e.g., lack of attention to shared channel features does not affect the expression of discriminative channel features).

In the above scheme provided by the embodiment of the present invention, the internal parameters of the encoder and the decoder based on vector-to-sequence need to be optimized by pre-selecting a loss function, where the loss function is expressed as:

；

wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>For the real label of the character of time step t (an example of a relevant real label is provided on the right side of fig. 4), M is the total number of time steps, which is equivalent to the maximum number of characters, e.g., m=25 can be set; l is a loss function.

Illustratively, a random gradient descent (SGD) pair may be employed for end-to-end training. At the beginning of training, the learning rate was selected to be 0.001, the learning rate was reduced to be 0.0001 after 10 epochs (rounds), and the training was completed after 20 epochs in total.

For example, existing data sets may be employed for training. For example, using the ST dataset with the 90K dataset; wherein: the ST (SynthText) dataset is a composite dataset containing 8 ten thousand composite images. The 90K (SynthText 90K) dataset is another composite dataset that contains 900 tens of thousands of images, with the ST dataset used for training along with the 90K dataset.

Compared with various existing scene character recognition methods, the method provided by the embodiment of the invention has better recognition accuracy. In order to more intuitively embody the performance of the present invention, the following description is made in connection with experiments.

1. The performance of the present invention in a scene word recognition task is described in connection with a dataset.

The data set employed in this section includes: IIIT5K, IC13, SVT, IC15, SVTP, and CT datasets.

ICDAR2013 (IC 13): the dataset contained 1095 test images, with images containing less than 3 characters or containing non-alphanumeric characters discarded in the experiment.

ICDAR2015 (IC 15): the dataset provides 500 images of the scene. By filtering some extremely distorted images, 1811 cropped text image blocks are finally preserved.

IIIT 5K-Words (IIIT 5K): the dataset contained 3000 images collected from the website, all for the experiment.

Street View Text (SVT): the dataset is cut out from 250 images of google streetscape according to word level labels to obtain 647 text image blocks.

Street View Text-Perspective (SVTP): the dataset contained 639 images, also cropped from google street view images, many of which were severely distorted.

CUTE80 (CT): the data set is used to evaluate the performance of the model in recognizing curved text. It contains 288 cut text image blocks.

The present invention provides schemes with accuracy (precision) in IIIT5K, IC13, SVT, IC15, SVTP, and CT data sets of 95.1%,98.4%,96.0%,87.0%,90.5% and 89.9%, respectively.

2. Compared with the identification effect of the prior method.

FIG. 6 shows the result of comparing the present invention with the prior art scene text recognition method; the four columns in fig. 6 are: the first column is 6 natural scene images to be identified, the second column is the identification result of the invention, the third column is the identification result of the attention-based decoder (i.e., the attention-based decoder is used in the method shown in fig. 1), and the fourth column is the identification result of the CTC-based decoder (i.e., the CTC-based decoder is used in the method shown in fig. 1). As can be seen from fig. 6, the present invention can accurately recognize each character in each image.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides a natural scene text recognition system, which is mainly realized based on the method provided by the previous embodiment, as shown in fig. 7, and mainly comprises:

In the embodiment of the invention, the parallel generation of the channel attention map of each time step by using the global vector comprises the following steps: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.

In the embodiment of the present invention, the manner of generating the channel attention map of a single time step is expressed as:

；

In the embodiment of the invention, the internal parameters of the encoder and the decoder based on vector to sequence are optimized in advance by using a loss function, and the loss function is expressed as:

；

wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A natural scene text recognition method is characterized by comprising the following steps:

step 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;

generating a channel attention map for each time step in parallel using the global vector includes: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.

2. The method of claim 1, wherein the way to generate a single time step channel attention map is expressed as:

；

wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel representing time step tNote that in the force diagram, V represents a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.

3. The method of claim 1, wherein step 1 is implemented by an encoder, and step 3 is implemented by a decoder based on vector-to-sequence, and the internal parameters of the encoder and the decoder based on vector-to-sequence are optimized in advance by using a loss function, where the loss function is expressed as:

；

4. A natural scene text recognition system, comprising:

the vector-to-sequence decoder is used for generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;

5. The natural scene text recognition system of claim 4, wherein the way to generate the channel attention map for a single time step is expressed as:

；

wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,for normalizing the exponential function, the method is composed of a normalization layerAnd performing normalization operation.

6. The natural scene text recognition system of claim 4, wherein the encoder and the vector-to-sequence based decoder are each optimized in advance with a loss function, the loss function being expressed as:

；

7. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.

8. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-3.