CN116343190B - Natural scene character recognition method, system, equipment and storage medium - Google Patents

Natural scene character recognition method, system, equipment and storage medium Download PDF

Info

Publication number
CN116343190B
CN116343190B CN202310623773.4A CN202310623773A CN116343190B CN 116343190 B CN116343190 B CN 116343190B CN 202310623773 A CN202310623773 A CN 202310623773A CN 116343190 B CN116343190 B CN 116343190B
Authority
CN
China
Prior art keywords
time step
character
vector
time
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310623773.4A
Other languages
Chinese (zh)
Other versions
CN116343190A (en
Inventor
张勇东
王裕鑫
谢洪涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310623773.4A priority Critical patent/CN116343190B/en
Publication of CN116343190A publication Critical patent/CN116343190A/en
Application granted granted Critical
Publication of CN116343190B publication Critical patent/CN116343190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19127Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a natural scene character recognition method, a system, equipment and a storage medium, which are one-to-one schemes, wherein: the method and the device have the advantages that images are encoded into a vector space, so that local and global multi-granularity semantics are given, global vectors are obtained through aggregation, and channel attention diagrams of different time steps are generated in parallel, so that character information of different time steps is decoded, and due to the fact that a vector-to-sequence decoding mode is adopted, not only can recognition speed be improved, but also different characters share some feature expressions (such as attention diagrams are activated strongly) in the channel space, but some channel weights with distinguishing force features also have certain differences, therefore, the fact that the global vectors can generate robust character feature expressions (such as the fact that lack of attention to shared channel features does not affect the expression of distinguishing force channel features) can be ensured under the condition of low-quality attention diagrams, and therefore, the scheme provided by the invention can accurately recognize characters of natural scenes.

Description

Natural scene character recognition method, system, equipment and storage medium
Technical Field
The present invention relates to the field of natural scene text recognition technologies, and in particular, to a natural scene text recognition method, system, device, and storage medium.
Background
The natural scene character recognition is a universal character recognition technology, has become a hot research direction in the fields of computer vision and document analysis in recent years, and is widely applied to the fields of automatic driving, license plate recognition, vision impaired people assistance and the like. The goal of this task is to convert the text content in the image into editable text.
Since characters in a natural scene have the characteristics of low resolution, complex background, easiness in noise interference and the like, the traditional character recognition technology cannot be applied to the natural scene. Therefore, the character recognition technology in the natural scene has great research significance.
With the development of deep learning technology in the field of computer vision in recent years, the recent scene character recognition method achieves a better effect. In the text recognition process, as shown in fig. 1, an input image is firstly encoded into a sequence signal, and the sequence-to-sequence decoding mechanism is realized through a CNN (convolutional neural network); the sequence of character information is then decoded by the alignment structure, which is implemented by a sequence-to-sequence decoder, which may be an attention-based decoder or a CTC (connection timing classification) based decoder, and the characters provided at the top of fig. 1 are examples. However, the sequence-to-sequence alignment structure is complex in design, and the speed and the robustness of the character recognition process cannot be effectively balanced, so that the speed and the accuracy of scene character recognition are still to be improved.
Disclosure of Invention
The invention aims to provide a natural scene character recognition method, a system, equipment and a storage medium, which can rapidly and accurately recognize characters of a natural scene.
The invention aims at realizing the following technical scheme:
a natural scene text recognition method comprises the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
step 2, aggregating the multi-granularity visual feature vectors to obtain global vectors;
and 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step.
A natural scene text recognition system, comprising:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
and the vector-to-sequence decoder is used for generating a channel attention map of each time step in parallel by using the global vector, obtaining a character feature vector of each time step by combining the global vector, and predicting the character of each time step by using the character feature vector of each time step.
A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.
According to the technical scheme provided by the invention, images are encoded into a vector space, so that local and global multi-granularity semantics are endowed, global vectors are obtained through aggregation, and then channel attention patterns with different time steps are generated in parallel, so that character information with different time steps is decoded; meanwhile, the invention simplifies the sequence-to-sequence decoding mode, adopts a vector-to-sequence decoding mode, and therefore, the recognition speed is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a conventional scene text recognition method provided by the background of the invention;
FIG. 2 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a visual channel attention map according to an embodiment of the present invention;
FIG. 6 is a diagram comparing the present invention with the conventional scene text recognition method according to the embodiment of the present invention;
FIG. 7 is a schematic diagram of a natural scene text recognition system according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms that may be used herein will first be described as follows:
the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.
The following describes a natural scene text recognition method, system, equipment and storage medium. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.
Example 1
The embodiment of the present invention provides a natural scene text recognition method, fig. 2 illustrates the main principle of the method, unlike the existing scene text recognition method illustrated in fig. 1, the encoder of the present invention is an image-to-vector encoder, wherein ViT is a visual Transformer network responsible for encoding an input image into a plurality of granularity visual feature vectors, each granularity visual feature vector being given local and global multi-granularity semantics, and ViT networks provided herein are only examples; thereafter, global vectors (not shown in fig. 2) are obtained by aggregation; the global vector is decoded in parallel by a vector-to-sequence decoder to obtain character information of different time steps, and the characters provided at the top of fig. 2 are all examples; in a vector-to-sequence decoder, the present invention uses a channel attention approach to decode the character information for each time step from the global vector. Experimental results show that the method provided by the invention achieves the most advanced performance in scene character recognition tasks. The following describes the above method in detail, fig. 3 shows the main flow of the above method, and fig. 4 shows the relevant framework of the above method, and the above method mainly includes the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module.
In the embodiment of the present invention, step 1 is implemented by an encoder, which may be a ViT network listed above.
As shown in fig. 4, the encoder mainly includes an embedded layer and a multi-layer transducer module. The embedded layer is mainly responsible for converting the natural scene image to be identified into sequence information, for example, the size of the natural scene image to be identified is 128×32, and the natural scene image to be identified is converted into sequence information with the length of 32. The multi-layer transducer module is mainly responsible for extracting multi-granularity visual feature vectors from the sequence information, where N in fig. 4 is the number of layers of the transducer module, for example, n=12 may be set.
And 2, aggregating the multi-granularity visual feature vectors to obtain a global vector.
In the embodiment of the invention, the step 2 can be realized by a feature aggregation module, and global average pooling operation can be adopted during aggregation, and the sum average of all the visual feature vectors is calculated to obtain a global vector. The global vector inputs a vector-to-sequence based decoder for decoding the characters at different time steps.
And 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step.
In the embodiment of the invention, the step 3 can be realized by a decoder of a vector-to-sequence, and the main process is as follows:
(1) And generating a channel attention map of each time step in parallel by using the global vector. And generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
The manner in which the channel attention map for a single time step is generated is expressed as:
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
(2) The channel attention of each time step is striven for combining the global vector to obtain the character feature vector of each time step; specifically, the channel attention map of each time step may be multiplied by the global vector point to obtain the character feature vector of each time step.
wherein ,is the character feature vector of time step t.
(3) Classifying character feature vectors of each time step through a fully-connected classification layer, and predicting characters of each time step; the prediction mode is expressed as:
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The classification layer is fully connected.
As shown in fig. 4, the parallel channel attention map calculation module in the vector-to-sequence decoder is mainly responsible for executing the (1) th to (2) th portions, and the fully-connected classification layer is responsible for executing the (3) th portion. In fig. 4, 1× C, M × C, M ×k each represents a dimension, for example, 1×c is a dimension of a global vector, C represents the number of channels, M represents the maximum number of characters (i.e., the maximum time step), and K represents the number of categories of characters.
Fig. 5 shows an example of a visual channel attention map, where the vector-to-sequence based decoder has the property of feature multiplexing, i.e. different character classes share some feature expressions in the channel space (e.g. channels where all attention maps are strongly activated). However, some channel weights with distinguishing features (i.e., associated channel features) also differ somewhat. Thus, feature reusability ensures that global vectors can also generate robust character feature vectors in the case of low quality attention seeking (e.g., lack of attention to shared channel features does not affect the expression of discriminative channel features).
In the above scheme provided by the embodiment of the present invention, the internal parameters of the encoder and the decoder based on vector-to-sequence need to be optimized by pre-selecting a loss function, where the loss function is expressed as:
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>For the real label of the character of time step t (an example of a relevant real label is provided on the right side of fig. 4), M is the total number of time steps, which is equivalent to the maximum number of characters, e.g., m=25 can be set; l is a loss function.
Illustratively, a random gradient descent (SGD) pair may be employed for end-to-end training. At the beginning of training, the learning rate was selected to be 0.001, the learning rate was reduced to be 0.0001 after 10 epochs (rounds), and the training was completed after 20 epochs in total.
For example, existing data sets may be employed for training. For example, using the ST dataset with the 90K dataset; wherein: the ST (SynthText) dataset is a composite dataset containing 8 ten thousand composite images. The 90K (SynthText 90K) dataset is another composite dataset that contains 900 tens of thousands of images, with the ST dataset used for training along with the 90K dataset.
Compared with various existing scene character recognition methods, the method provided by the embodiment of the invention has better recognition accuracy. In order to more intuitively embody the performance of the present invention, the following description is made in connection with experiments.
1. The performance of the present invention in a scene word recognition task is described in connection with a dataset.
The data set employed in this section includes: IIIT5K, IC13, SVT, IC15, SVTP, and CT datasets.
ICDAR2013 (IC 13): the dataset contained 1095 test images, with images containing less than 3 characters or containing non-alphanumeric characters discarded in the experiment.
ICDAR2015 (IC 15): the dataset provides 500 images of the scene. By filtering some extremely distorted images, 1811 cropped text image blocks are finally preserved.
IIIT 5K-Words (IIIT 5K): the dataset contained 3000 images collected from the website, all for the experiment.
Street View Text (SVT): the dataset is cut out from 250 images of google streetscape according to word level labels to obtain 647 text image blocks.
Street View Text-Perspective (SVTP): the dataset contained 639 images, also cropped from google street view images, many of which were severely distorted.
CUTE80 (CT): the data set is used to evaluate the performance of the model in recognizing curved text. It contains 288 cut text image blocks.
The present invention provides schemes with accuracy (precision) in IIIT5K, IC13, SVT, IC15, SVTP, and CT data sets of 95.1%,98.4%,96.0%,87.0%,90.5% and 89.9%, respectively.
2. Compared with the identification effect of the prior method.
FIG. 6 shows the result of comparing the present invention with the prior art scene text recognition method; the four columns in fig. 6 are: the first column is 6 natural scene images to be identified, the second column is the identification result of the invention, the third column is the identification result of the attention-based decoder (i.e., the attention-based decoder is used in the method shown in fig. 1), and the fourth column is the identification result of the CTC-based decoder (i.e., the CTC-based decoder is used in the method shown in fig. 1). As can be seen from fig. 6, the present invention can accurately recognize each character in each image.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
Example two
The invention also provides a natural scene text recognition system, which is mainly realized based on the method provided by the previous embodiment, as shown in fig. 7, and mainly comprises:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
and the vector-to-sequence decoder is used for generating a channel attention map of each time step in parallel by using the global vector, obtaining a character feature vector of each time step by combining the global vector, and predicting the character of each time step by using the character feature vector of each time step.
In the embodiment of the invention, the parallel generation of the channel attention map of each time step by using the global vector comprises the following steps: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
In the embodiment of the present invention, the manner of generating the channel attention map of a single time step is expressed as:
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
In the embodiment of the invention, the internal parameters of the encoder and the decoder based on vector to sequence are optimized in advance by using a loss function, and the loss function is expressed as:
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.
Example III
The present invention also provides a processing apparatus, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.
In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.
Example IV
The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (8)

1. A natural scene text recognition method is characterized by comprising the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
step 2, aggregating the multi-granularity visual feature vectors to obtain global vectors;
step 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;
generating a channel attention map for each time step in parallel using the global vector includes: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
2. The method of claim 1, wherein the way to generate a single time step channel attention map is expressed as:
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel representing time step tNote that in the force diagram, V represents a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
3. The method of claim 1, wherein step 1 is implemented by an encoder, and step 3 is implemented by a decoder based on vector-to-sequence, and the internal parameters of the encoder and the decoder based on vector-to-sequence are optimized in advance by using a loss function, where the loss function is expressed as:
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
4. A natural scene text recognition system, comprising:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
the vector-to-sequence decoder is used for generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;
generating a channel attention map for each time step in parallel using the global vector includes: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
5. The natural scene text recognition system of claim 4, wherein the way to generate the channel attention map for a single time step is expressed as:
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,for normalizing the exponential function, the method is composed of a normalization layerAnd performing normalization operation.
6. The natural scene text recognition system of claim 4, wherein the encoder and the vector-to-sequence based decoder are each optimized in advance with a loss function, the loss function being expressed as:
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
7. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.
8. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-3.
CN202310623773.4A 2023-05-30 2023-05-30 Natural scene character recognition method, system, equipment and storage medium Active CN116343190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310623773.4A CN116343190B (en) 2023-05-30 2023-05-30 Natural scene character recognition method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310623773.4A CN116343190B (en) 2023-05-30 2023-05-30 Natural scene character recognition method, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116343190A CN116343190A (en) 2023-06-27
CN116343190B true CN116343190B (en) 2023-08-29

Family

ID=86879119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310623773.4A Active CN116343190B (en) 2023-05-30 2023-05-30 Natural scene character recognition method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116343190B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037136B (en) * 2023-10-10 2024-02-23 中国科学技术大学 Scene text recognition method, system, equipment and storage medium
CN117912005B (en) * 2024-03-19 2024-07-05 中国科学技术大学 Text recognition method, system, device and medium using single mark decoding

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541501A (en) * 2020-12-18 2021-03-23 北京中科研究院 Scene character recognition method based on visual language modeling network
CN114399757A (en) * 2022-01-13 2022-04-26 福州大学 Natural scene text recognition method and system for multi-path parallel position correlation network
CN115116066A (en) * 2022-06-17 2022-09-27 复旦大学 Scene text recognition method based on character distance perception
CN115471851A (en) * 2022-10-11 2022-12-13 小语智能信息科技(云南)有限公司 Burma language image text recognition method and device fused with double attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565305B2 (en) * 2016-11-18 2020-02-18 Salesforce.Com, Inc. Adaptive attention model for image captioning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541501A (en) * 2020-12-18 2021-03-23 北京中科研究院 Scene character recognition method based on visual language modeling network
CN114399757A (en) * 2022-01-13 2022-04-26 福州大学 Natural scene text recognition method and system for multi-path parallel position correlation network
CN115116066A (en) * 2022-06-17 2022-09-27 复旦大学 Scene text recognition method based on character distance perception
CN115471851A (en) * 2022-10-11 2022-12-13 小语智能信息科技(云南)有限公司 Burma language image text recognition method and device fused with double attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张重生 等.基于Transformer的低质场景字符检测算法.北京邮电大学学报.2022,第45卷(第2期),第124-130页. *

Also Published As

Publication number Publication date
CN116343190A (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
CN107291822B (en) Problem classification model training method, classification method and device based on deep learning
CN116343190B (en) Natural scene character recognition method, system, equipment and storage medium
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN111476023B (en) Method and device for identifying entity relationship
CN112541501B (en) Scene character recognition method based on visual language modeling network
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN111914731B (en) Multi-mode LSTM video motion prediction method based on self-attention mechanism
CN114663915A (en) Image human-object interaction positioning method and system based on Transformer model
CN116311214B (en) License plate recognition method and device
CN113392265A (en) Multimedia processing method, device and equipment
CN114529903A (en) Text refinement network
CN115512195A (en) Image description method based on multi-interaction information fusion
Sah et al. Understanding temporal structure for video captioning
CN117235605B (en) Sensitive information classification method and device based on multi-mode attention fusion
CN117315249A (en) Image segmentation model training and segmentation method, system, equipment and medium
Chen et al. Audio captioning with meshed-memory transformer
CN114020871A (en) Multi-modal social media emotion analysis method based on feature fusion
Liu et al. Attention-based convolutional LSTM for describing video
CN117912005B (en) Text recognition method, system, device and medium using single mark decoding
CN110852206A (en) Scene recognition method and device combining global features and local features
CN118155231B (en) Document identification method, device, equipment, medium and product
CN116977436B (en) Burmese text image recognition method and device based on Burmese character cluster characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant