CN113408525A - Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method - Google Patents

Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method Download PDF

Info

Publication number
CN113408525A
CN113408525A CN202110672336.2A CN202110672336A CN113408525A CN 113408525 A CN113408525 A CN 113408525A CN 202110672336 A CN202110672336 A CN 202110672336A CN 113408525 A CN113408525 A CN 113408525A
Authority
CN
China
Prior art keywords
text
image
inputting
convolution
multilayer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110672336.2A
Other languages
Chinese (zh)
Other versions
CN113408525B (en
Inventor
纪禄平
李�真
陈香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Chonghu Information Technology Co ltd
Original Assignee
Chengdu Chonghu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Chonghu Information Technology Co ltd filed Critical Chengdu Chonghu Information Technology Co ltd
Priority to CN202110672336.2A priority Critical patent/CN113408525B/en
Publication of CN113408525A publication Critical patent/CN113408525A/en
Application granted granted Critical
Publication of CN113408525B publication Critical patent/CN113408525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of text recognition, in particular to a text recognition method fusing multilayer ternary pivot elements and bidirectional long-term and short-term memory, which comprises the following steps: firstly, inputting a scene image; secondly, obtaining image characteristic output through a TPCANet model based on multilayer fusion; inputting the image characteristics into a BLSTM network prediction confidence coefficient; inputting the coordinates of the text box which is fully connected for predicting the most possible text box; fifthly, cutting the target text box on the original image according to the coordinates of the text box; inputting the cut text box into a TPCANet model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics; seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence; and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence. The invention has better identification capability.

Description

Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method
Technical Field
The invention relates to the technical field of text recognition, in particular to a text recognition method fusing multilayer ternary pivot elements and bidirectional long-term and short-term memory.
Background
The appearance of characters has important practical significance and historical significance for the development and inheritance of human civilization, and the communication of external ideas, the development of culture and the history in ancient times and modern times do not depend on the appearance of characters. The characters are not only carriers of information, but also important means for human to recognize the world, and can not only independently transmit the information, but also complement other visual elements to transmit higher-level language meaning. With the vigorous development of the economic society, text elements in natural scenes, such as bus stop boards, road signboards, shopping mall billboards and the like, are visible everywhere. These textual indications can reveal to us a large amount of context information awaiting our exploration and utilization.
Before applying deep learning to text detection and recognition, there has been a corresponding solution to document detection — OCR (optical character recognition). In the past, hardware devices have lagged behind, objective requirements for natural scene text detection and recognition have also been low, and OCR, the most advanced document recognition technology at that time, has provided convenient services for human beings. Although OCR is limited to document recognition, the recognition rate is low, and a large amount of manual assistance is needed, the technology is developed for a long time, and people study and live in various aspects of life at present.
In recent years, computer software and hardware equipment are rapidly developed, intelligent applications such as intelligent homes, intelligent driving, robot guidance and photo translation systems fall into the ground at a glance, and the applications can provide very convenient services for human beings by means of understanding of natural environment information. The need to obtain textual information in natural scenes has become more stringent.
The difficulty of text recognition in images of natural scenes is much greater than that in scanned document images. Unlike the rule of scanning text which is consistent with the background, the text in natural scenes is very rich in presentation form. Scene characters have text mixtures of multiple languages, and even text characters can have special presentations such as different sizes, font styles, colors, brightness, contrast and the like. The text lines may also appear in irregular patterns of horizontal, vertical, curved, rotated, twisted, etc. In particular, the background of the natural scene image is complex and diverse, for example, text may appear on a plane, a curved surface or a corrugated surface, complex interference texture may also appear near a text region, or a non-text region has texture similar to that of a character, and the text region may also generate deformation, such as perspective, affine transformation, incomplete, fuzzy, and the like.
Different from the past, with the rise of artificial intelligence, the development of deep neural network learning theory and the rapid iteration of computing hardware, the text detection and identification of natural scenes cater to the wave belonging to the natural scene. The achievement of deep learning in the field of image recognition lays a solid foundation for the solution of the text detection and recognition problem of the natural scene, and the natural scene text detection and recognition subject based on the interdisciplinary subjects of computer vision, natural language processing and the like becomes an important research hotspot for recognizing the text problem of the natural scene. Because deep learning has strong fitting ability, compared with the traditional OCR technology, deep neural network learning has the prospect of solving the problems of text detection and recognition in complex natural scenes.
Disclosure of Invention
It is an object of the present invention to provide a method for text recognition that combines multi-level ternary pivot with two-way long and short term memory that overcomes some or all of the deficiencies of the prior art.
The invention relates to a text recognition method with fusion of multilayer ternary pivot and bidirectional long-short term memory, which comprises the following steps:
firstly, inputting a scene image containing text information into a scene text model;
secondly, obtaining image characteristic output through a multilayer ternary principal component network TPCANet model based on multilayer fusion;
inputting the image characteristics into a BLSTM network of a long-term and short-term memory network to predict the confidence coefficient of k corresponding anchor frames on each pixel point;
fourthly, inputting full connection to predict the most possible coordinates of the text box;
fifthly, cutting the target text box on the original image according to the coordinates of the text box;
inputting the cut text box into a multilayer ternary principal component network (TPCANet) model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics;
seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence of the long-term and short-term memory network;
and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence.
Preferably, the multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k1×k2The radius of the ternary neighborhood is r;
step1, inputting an image data set I containing text characters;
step2 for input image sample IiSampling blocks with neighborhood radius r, and carrying out ternary operation processing;
step3, carrying out de-equalization processing on the image samples subjected to the tri-equalization processing at each Step2, and cascading all the image samples subjected to the de-equalization processing to form a matrix;
step 4. principal component analysis is carried out on the matrix generated in Step3 to obtain a first stage L1A convolution kernel
Figure BDA0003119865070000031
Step5 use of first phase L1Convolution check of original image IiPerforming convolution to obtain corresponding L1Characteristic image
Figure BDA0003119865070000032
Step6, carrying out de-equalization processing on the characteristic images generated in the first stage of the whole image data set, and cascading all the characteristic images subjected to de-equalization processing to form a matrix;
step7, performing principal component analysis on the matrix generated in Step6 to obtain L2 convolution kernels in the second stage
Figure BDA0003119865070000033
Step8 Using the second phase L2The convolution kernel checks the ith original image IiIn the first stage from1Characteristic image obtained by convolution kernel convolution of first layer
Figure BDA0003119865070000034
Making convolution to obtain corresponding L2Characteristic image
Figure BDA0003119865070000035
Thus, the ith original image IiIn total will generate L in the second stage1×L2The characteristic image:
Figure BDA0003119865070000036
wherein l1=1,2,…,L1
Step9, carrying out weighted fusion on the result generated by the convolution of the ith image sample obtained in Step5 and Step8 in the first stage and the convolution result generated in the second stage
Figure BDA0003119865070000037
Figure BDA0003119865070000038
Preferably, in the second step, feature extraction is performed first, then sliding sampling is performed on a multi-size window, and sliding sampling is performed on the convolution result with spatial information by using a 3 × 3 spatial sample block.
Preferably, in step three, BLSTM will be paired from two directionsCoding a cyclic context, circularly and sequentially inputting the convolution characteristics of each sliding window into two LSTM networks from two directions in a cyclic mode, and updating H in the hidden layer of the internal characteristics of the LSTM networkstThe cycling state of (a);
Ht=φ(Ht-1,Xt),t=1,2,...,W;
wherein HtIs a cyclic internal state, from the current input XtAnd Ht-1The first side state of the middle code is obtained by common calculation; xt∈R3×3×CIs a convolution characteristic of the tth sliding sample window.
The text recognition method with the fusion of the multilayer ternary pivot and the bidirectional long-term and short-term memory has a good feature extraction effect and has a good recognition effect on natural scene text recognition.
Drawings
FIG. 1 is a flowchart of a text recognition method of merging multi-layer ternary pivot elements with two-way long-short term memory in embodiment 1;
FIG. 2 is a schematic diagram of the bidirectional LSTM in example 1;
FIG. 3 is a schematic representation of the CTC transcription process in example 1;
fig. 4 is a schematic diagram illustrating recognition of a natural scene text recognition application in embodiment 1.
Detailed Description
For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.
Example 1
As shown in fig. 1, the present embodiment provides a text recognition method fusing a multi-layer ternary pivot and a bidirectional long-short term memory, which includes the following steps:
firstly, inputting a scene image containing text information into a scene text model;
secondly, obtaining image characteristic output through a multilayer ternary principal component network TPCANet model based on multilayer fusion;
inputting the image characteristics into a BLSTM network of a long-term and short-term memory network to predict the confidence coefficient of k corresponding anchor frames on each pixel point;
fourthly, inputting full connection to predict the most possible coordinates of the text box;
fifthly, cutting the target text box on the original image according to the coordinates of the text box;
inputting the cut text box into a multilayer ternary principal component network (TPCANet) model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics;
seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence of the long-term and short-term memory network;
and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence.
The multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k1×k2The radius of the ternary neighborhood is r;
step1, inputting an image data set I containing text characters;
step2 for input image sample IiSampling blocks with neighborhood radius r, and carrying out ternary operation processing;
step3, carrying out de-equalization processing on the image samples subjected to the tri-equalization processing at each Step2, and cascading all the image samples subjected to the de-equalization processing to form a matrix;
step 4. principal component analysis is carried out on the matrix generated in Step3 to obtain a first stage L1A convolution kernel
Figure BDA0003119865070000051
Step5 use of first phase L1Convolution check of original image IiPerforming convolution to obtain corresponding L1Characteristic image
Figure BDA0003119865070000052
Step6, carrying out de-equalization processing on the characteristic images generated in the first stage of the whole image data set, and cascading all the characteristic images subjected to de-equalization processing to form a matrix;
step7, performing principal component analysis on the matrix generated in Step6 to obtain L2 convolution kernels in the second stage
Figure BDA0003119865070000053
Step8 Using the second phase L2The convolution kernel checks the ith original image IiIn the first stage from1Characteristic image obtained by convolution kernel convolution of first layer
Figure BDA0003119865070000054
Making convolution to obtain corresponding L2Characteristic image
Figure BDA0003119865070000055
Thus, the ith original image IiIn total will generate L in the second stage1×L2The characteristic image:
Figure BDA0003119865070000061
wherein l1=1,2,…,L1
Step9, carrying out weighted fusion on the result generated by the convolution of the ith image sample obtained in Step5 and Step8 in the first stage and the convolution result generated in the second stage
Figure BDA0003119865070000062
Figure BDA0003119865070000063
In the second step, feature extraction is firstly carried out, then sliding sampling is carried out on a multi-size window, and sliding sampling is carried out on the convolution result with the spatial information by adopting a 3X 3 space sampling block. The sliding window approach employs multi-scale windows to detect objects of different sizes and employs a Vertical Anchor Mechanism (Vertical Anchor Mechanism) to predict the location and Text/non-Text score of each fixed-width Text offer (Text prompt).
These sequential sliding sampling results will be circularly input to BLSTM and predict the confidence of these text line slices (text suggestion boxes), respectively. Due to the unique bi-directional cyclic linking mechanism of BLSTM, the detector can explore the context information to the text line. BLSTM will encode the cyclic context from two directions, colloquially speaking, the convolution feature of each sliding window is circularly and sequentially input into two LSTM networks from two directions, and update the cyclic state of Ht in its internal feature hidden layer:
Ht=φ(Ht-1,Xt),t=1,2,...,W;
wherein HtIs a cyclic internal state, from the current input XtAnd Ht-1The first side state of the middle code is obtained by common calculation; xt∈R3×3×CIs a convolution characteristic of the tth sliding sample window. The process is shown in figure 2.
CTC is a technology for summarizing continuous characteristics between characters, is used to solve a problem that input/output data tags do not correspond to each other, and is widely used in text line recognition and speech recognition where input/output cannot be aligned. The CTC core technique is a loss function that measures how much the input sequence differs from the true output after passing through the neural network. Mathematically, the loss calculation is to solve the summary of the overall probability, solve the tag sequence with the highest probability, and then output the corresponding text sequence. I.e. maximizing the posterior probability P (Y/X) based on a given input X. FIG. 3 is the transcription of the input signature sequence by CTC, X0To X14 is an enumerated sequence of features, which contain positional relationships between the features, which are sequentially input into the CTC model. CTC will solve for the maximum probability sequence for the signature sequence. The blocks in the figure represent the probability that the current feature is a certain character, with deeper representing higher probability.
The convolution kernel solution of the multilayer fusion TPCANet feature extraction model provided by the embodiment is based on a principal component analysis principle and is obtained by solving the feature matrix of the covariance matrix, and is different from the process that other deep convolution data neural networks have backward propagation, so that the convolution kernel solution can be integrated into the existing two-segment natural scene text recognition model.
Experimental testing and analysis
Introduction of data set:
in the experiment, a test experiment is carried out on the natural scene text recognition model based on the multilayer fusion TPCANet, which is provided in the chapter, on the ICDAR 2003, the ICDAR 2015 and the SVT data set. These data sets provide not only rich natural scene pictures containing textual information, but also the positioning of text regions in the image and the corresponding text. Taking SVT datasets as an example, the datasets are from google street view images, the datasets contain high quality photos and a large number of low quality photos, and the images are provided with a train file and a test file in XML format, respectively holding the coordinates of the text regions of the images and the corresponding text sequences.
Data augmentation and data set partitioning:
first the dataset is augmented, one image is augmented by two, then the total number of data samples will increase to 3 times the original size. And because the data sets of the experiment are numerous and the data set quantities are inconsistent, the automatic division mode with repeated sampling is adopted for the division of the data sets to select the training set and the test set. That is, each time a sample is taken from the data set as an element in the training set, then the sample is put back, and the action is repeated M times, so that a training set with a size of M can be obtained, wherein some samples are repeatedly sampled, and some samples are not sampled, so that the part of samples which are not sampled can be used as a test set. In this way, the test set partitioned from the data set is roughly 1/e of the total data set:
Figure BDA0003119865070000071
where M is the data set size and 1-1/M is the probability that each sample was not sampled.
Evaluation criteria:
the evaluation criteria selected herein during the text detection phase comprises three parts: recall (Recall), accuracy (Precision), and harmonic mean (FMeasure).
The evaluation criteria selected herein during the text detection phase comprises two parts: standard edit distance metric and word recognition rate.
The standard edit distance is the minimum number of times required for converting a sequence into another character through editing operation, and is based on the sum of normalized edit distances between the standard edit distance measurement recognition result and the real character, and the character string SiAnd SjNormalized edit distance of (1).
The character recognition rate is another evaluation standard for analyzing the performance of the text recognition model, and is the ratio of the total number of correctly recognized characters to the total number of all characters to be recognized. The character recognition rate is further divided into evaluation criteria of dictionary presence and dictionary absence according to the presence or absence of constraints. Dictionary-constrained transcription finds the character with the minimum edit distance from the original output from the dictionary, and dictionary-free transcription directly takes the maximum probability label value predicted at the time t as a result.
Experiment one:
firstly, the performance and detection performance of the natural scene text recognition model based on multi-layer fusion TPCANet, which is proposed in this chapter, on a horizontal text data set (ICDAR 2003), a slant distortion text data set (ICDAR 2015) and a data Set (SVT) with large font variation and lower image resolution are verified respectively.
TABLE 1
Data set Recall rate Rate of accuracy Harmonic mean
ICDAR 2003 84.21% 93.0% 88.2%
ICDAR 2015 52.12% 71.9% 64.43%
SVT 69.0% 81.9% 78%
As shown in table 1, in this experiment, the recall rate of the detection phase of the text recognition model of natural scene on the data set of ICDAR 2003 was 84.21%, the accuracy was 93.0%, and the reconciliation averaged 88.2%. Through experimental data analysis, the detection accuracy of the text detection model of the natural scene on an image data set containing horizontal text is obviously higher than that of a data set containing a distorted and inclined text image.
Experiment two:
the performance and recognition performance of the natural scene text recognition model based on the multi-layer fusion TPCANet, which is proposed in this chapter, on a horizontal text data set (ICDAR 2003) and a data Set (SVT) with large font variation and low resolution of most images are verified respectively.
TABLE 2
Data set Rate of accuracy
ICDAR 2003 89%
SVT 74.23%
As shown in table 2, experiments show that the natural scene text recognition model based on the multi-layer fusion TPCANet proposed herein performs well on a text recognition data set in the horizontal direction. Through experimental data analysis, the detection accuracy of the recognition stage of the text recognition model of the natural scene in the image data set ICDAR 2003 containing horizontal text is obviously higher than that in the data set SVT containing twisted and inclined text images by 14.77%.
Experiment three:
based on the above experimental results, a comparison of the recognition rates of the different models was next performed on a data set containing horizontal text images (ICDAR 2003).
TABLE 3
Method model Rate of identification accuracy
MTPCANet-CTPN-CRNN 81.9%
CTPN+CRNN 71.9%
As shown in table 3, experiments show that the accuracy of the natural scene text recognition model based on multi-layer fusion TPCANet (MTPCANet-CTPN-CRNN) proposed herein on the ICDAR 2003 data set is 81.9%, which is a little better than the original CTPN + CRNN model.
Finally, the test sample of the natural scene text recognition application based on the multi-layer fusion TPCANet is provided, and as shown in FIG. 4, the original image, the text result and the text recognition are sequentially output. In the text box detection, a box is a text box area obtained by text detection prediction, and the number in the box is the sum of the confidence degrees of the anchor points of the text line.
Experimental results show that the natural scene text recognition model based on the multi-fusion TPCAnet integrated in the text can be slightly lower than a classic CTPN and CRNN combined model in training time, recognition accuracy is slightly improved compared with a classic model, and the natural scene text application based on the multi-layer fusion TPCAnet built in the text has certain practical significance.
The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims (4)

1. A text recognition method fusing multilayer ternary pivot elements and bidirectional long-short term memory is characterized in that: the method comprises the following steps:
firstly, inputting a scene image containing text information into a scene text model;
secondly, obtaining image characteristic output through a multilayer ternary principal component network TPCANet model based on multilayer fusion;
inputting the image characteristics into a BLSTM network of a long-term and short-term memory network to predict the confidence coefficient of k corresponding anchor frames on each pixel point;
fourthly, inputting full connection to predict the most possible coordinates of the text box;
fifthly, cutting the target text box on the original image according to the coordinates of the text box;
inputting the cut text box into a multilayer ternary principal component network (TPCANet) model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics;
seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence of the long-term and short-term memory network;
and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence.
2. The method as claimed in claim 1, wherein the text recognition method comprises the following steps: the multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k1×k2The radius of the ternary neighborhood is r;
step1, inputting an image data set I containing text characters;
step2 for input image sample IiSampling blocks with neighborhood radius r, and carrying out ternary operation processing;
step3, carrying out de-equalization processing on the image samples subjected to the tri-equalization processing at each Step2, and cascading all the image samples subjected to the de-equalization processing to form a matrix;
step 4. principal component analysis is carried out on the matrix generated in Step3 to obtain a first stage L1A convolution kernel
Figure FDA0003119865060000011
Step5 use of first phase L1Convolution check of original image IiPerforming convolution to obtain corresponding L1Characteristic image
Figure FDA0003119865060000012
Step6, carrying out de-equalization processing on the characteristic images generated in the first stage of the whole image data set, and cascading all the characteristic images subjected to de-equalization processing to form a matrix;
step7, performing principal component analysis on the matrix generated in Step6 to obtain a second stage L2A convolution kernel
Figure FDA0003119865060000021
Step8 Using the second phase L2The convolution kernel checks the ith original image IiIn the first stage from1Characteristic image obtained by convolution kernel convolution of first layer
Figure FDA0003119865060000022
Making convolution to obtain corresponding L2Characteristic image
Figure FDA0003119865060000023
Thus, the ith original image IiIn total will generate L in the second stage1×L2The characteristic image:
Figure FDA0003119865060000024
wherein l1=1,2,…,L1
Step9, carrying out weighted fusion on the result generated by the convolution of the ith image sample obtained in Step5 and Step8 in the first stage and the convolution result generated in the second stage
Figure FDA0003119865060000025
Figure FDA0003119865060000026
3. The method as claimed in claim 2, wherein the text recognition method comprises the following steps: in the second step, feature extraction is firstly carried out, then sliding sampling is carried out on a multi-size window, and sliding sampling is carried out on the convolution result with the spatial information by adopting a 3X 3 space sampling block.
4. The method as claimed in claim 3, wherein the text recognition method comprises the following steps: in step three, BLSTM encodes the cyclic context from two directions, the convolution characteristic of each sliding window is circularly and sequentially input into two LSTM networks from two directions, and H in the hidden layer of the internal characteristic is updatedtThe cycling state of (a);
Ht=φ(Ht-1,Xt),t=1,2,...,W;
wherein HtIs a cyclic internal state, from the current input XtAnd Ht-1The first side state of the middle code is obtained by common calculation; xt∈R3 ×3×CIs a convolution characteristic of the tth sliding sample window.
CN202110672336.2A 2021-06-17 2021-06-17 Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method Active CN113408525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110672336.2A CN113408525B (en) 2021-06-17 2021-06-17 Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110672336.2A CN113408525B (en) 2021-06-17 2021-06-17 Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Publications (2)

Publication Number Publication Date
CN113408525A true CN113408525A (en) 2021-09-17
CN113408525B CN113408525B (en) 2022-08-02

Family

ID=77684814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110672336.2A Active CN113408525B (en) 2021-06-17 2021-06-17 Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Country Status (1)

Country Link
CN (1) CN113408525B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10002301B1 (en) * 2017-09-19 2018-06-19 King Fahd University Of Petroleum And Minerals System, apparatus, and method for arabic handwriting recognition
CN108257151A (en) * 2017-12-22 2018-07-06 西安电子科技大学 PCANet image change detection methods based on significance analysis
CN108573693A (en) * 2017-03-14 2018-09-25 谷歌有限责任公司 It is synthesized using the Text To Speech of autocoder
US20180330183A1 (en) * 2017-05-11 2018-11-15 Canon Kabushiki Kaisha Image recognition apparatus, learning apparatus, image recognition method, learning method, and storage medium
CN110110095A (en) * 2019-04-29 2019-08-09 国网上海市电力公司 A kind of power command text matching technique based on shot and long term memory Recognition with Recurrent Neural Network
CN110472539A (en) * 2019-08-01 2019-11-19 上海海事大学 A kind of Method for text detection, device and computer storage medium
CN110533041A (en) * 2019-09-05 2019-12-03 重庆邮电大学 Multiple dimensioned scene text detection method based on recurrence
CN110956171A (en) * 2019-11-06 2020-04-03 广州供电局有限公司 Automatic nameplate identification method and device, computer equipment and storage medium
US20200151503A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Training Text Recognition Systems
CN111967471A (en) * 2020-08-20 2020-11-20 华南理工大学 Scene text recognition method based on multi-scale features
CN112561035A (en) * 2020-12-08 2021-03-26 上海海事大学 Fault diagnosis method based on CNN and LSTM depth feature fusion
CN112686252A (en) * 2020-12-28 2021-04-20 中国联合网络通信集团有限公司 License plate detection method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573693A (en) * 2017-03-14 2018-09-25 谷歌有限责任公司 It is synthesized using the Text To Speech of autocoder
US20180330183A1 (en) * 2017-05-11 2018-11-15 Canon Kabushiki Kaisha Image recognition apparatus, learning apparatus, image recognition method, learning method, and storage medium
US10002301B1 (en) * 2017-09-19 2018-06-19 King Fahd University Of Petroleum And Minerals System, apparatus, and method for arabic handwriting recognition
CN108257151A (en) * 2017-12-22 2018-07-06 西安电子科技大学 PCANet image change detection methods based on significance analysis
US20200151503A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Training Text Recognition Systems
CN110110095A (en) * 2019-04-29 2019-08-09 国网上海市电力公司 A kind of power command text matching technique based on shot and long term memory Recognition with Recurrent Neural Network
CN110472539A (en) * 2019-08-01 2019-11-19 上海海事大学 A kind of Method for text detection, device and computer storage medium
CN110533041A (en) * 2019-09-05 2019-12-03 重庆邮电大学 Multiple dimensioned scene text detection method based on recurrence
CN110956171A (en) * 2019-11-06 2020-04-03 广州供电局有限公司 Automatic nameplate identification method and device, computer equipment and storage medium
CN111967471A (en) * 2020-08-20 2020-11-20 华南理工大学 Scene text recognition method based on multi-scale features
CN112561035A (en) * 2020-12-08 2021-03-26 上海海事大学 Fault diagnosis method based on CNN and LSTM depth feature fusion
CN112686252A (en) * 2020-12-28 2021-04-20 中国联合网络通信集团有限公司 License plate detection method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEIMING LANG: "Wind Power Prediction Based on Principal Component Analysis and Long Short-Term Memory Networks", 《 2019 IEEE INNOVATIVE SMART GRID TECHNOLOGIES》 *
万萌: "基于深度学习的自然场景文字检测与识别方法研究", 《CNKI硕士论文库》 *
张博宇: "自然场景下的文本检测与识别方法研究", 《CNKI硕士论文库》 *

Also Published As

Publication number Publication date
CN113408525B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN110334705B (en) Language identification method of scene text image combining global and local information
CN110059217B (en) Image text cross-media retrieval method for two-stage network
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
CN111488826B (en) Text recognition method and device, electronic equipment and storage medium
CN113657124B (en) Multi-mode Mongolian translation method based on cyclic common attention transducer
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
Lin et al. STAN: A sequential transformation attention-based network for scene text recognition
CN105138998B (en) Pedestrian based on the adaptive sub-space learning algorithm in visual angle recognition methods and system again
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN111553350B (en) Deep learning-based attention mechanism text recognition method
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN109299303B (en) Hand-drawn sketch retrieval method based on deformable convolution and depth network
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN111738169A (en) Handwriting formula recognition method based on end-to-end network model
Wang et al. Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network
CN111985525A (en) Text recognition method based on multi-mode information fusion processing
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN115497122A (en) Method, device and equipment for re-identifying blocked pedestrian and computer-storable medium
Wang et al. Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Zatout et al. Semantic scene synthesis: application to assistive systems
CN114492646A (en) Image-text matching method based on cross-modal mutual attention mechanism
CN113159053A (en) Image recognition method and device and computing equipment
CN113408525B (en) Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant