CN113408525A

CN113408525A - Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Info

Publication number: CN113408525A
Application number: CN202110672336.2A
Authority: CN
Inventors: 纪禄平; 李�真; 陈香
Original assignee: Chengdu Chonghu Information Technology Co ltd
Current assignee: Chengdu Chonghu Information Technology Co ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-09-17
Anticipated expiration: 2041-06-17
Also published as: CN113408525B

Abstract

The invention relates to the technical field of text recognition, in particular to a text recognition method fusing multilayer ternary pivot elements and bidirectional long-term and short-term memory, which comprises the following steps: firstly, inputting a scene image; secondly, obtaining image characteristic output through a TPCANet model based on multilayer fusion; inputting the image characteristics into a BLSTM network prediction confidence coefficient; inputting the coordinates of the text box which is fully connected for predicting the most possible text box; fifthly, cutting the target text box on the original image according to the coordinates of the text box; inputting the cut text box into a TPCANet model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics; seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence; and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence. The invention has better identification capability.

Description

Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method

Technical Field

The invention relates to the technical field of text recognition, in particular to a text recognition method fusing multilayer ternary pivot elements and bidirectional long-term and short-term memory.

Background

The appearance of characters has important practical significance and historical significance for the development and inheritance of human civilization, and the communication of external ideas, the development of culture and the history in ancient times and modern times do not depend on the appearance of characters. The characters are not only carriers of information, but also important means for human to recognize the world, and can not only independently transmit the information, but also complement other visual elements to transmit higher-level language meaning. With the vigorous development of the economic society, text elements in natural scenes, such as bus stop boards, road signboards, shopping mall billboards and the like, are visible everywhere. These textual indications can reveal to us a large amount of context information awaiting our exploration and utilization.

Before applying deep learning to text detection and recognition, there has been a corresponding solution to document detection — OCR (optical character recognition). In the past, hardware devices have lagged behind, objective requirements for natural scene text detection and recognition have also been low, and OCR, the most advanced document recognition technology at that time, has provided convenient services for human beings. Although OCR is limited to document recognition, the recognition rate is low, and a large amount of manual assistance is needed, the technology is developed for a long time, and people study and live in various aspects of life at present.

In recent years, computer software and hardware equipment are rapidly developed, intelligent applications such as intelligent homes, intelligent driving, robot guidance and photo translation systems fall into the ground at a glance, and the applications can provide very convenient services for human beings by means of understanding of natural environment information. The need to obtain textual information in natural scenes has become more stringent.

The difficulty of text recognition in images of natural scenes is much greater than that in scanned document images. Unlike the rule of scanning text which is consistent with the background, the text in natural scenes is very rich in presentation form. Scene characters have text mixtures of multiple languages, and even text characters can have special presentations such as different sizes, font styles, colors, brightness, contrast and the like. The text lines may also appear in irregular patterns of horizontal, vertical, curved, rotated, twisted, etc. In particular, the background of the natural scene image is complex and diverse, for example, text may appear on a plane, a curved surface or a corrugated surface, complex interference texture may also appear near a text region, or a non-text region has texture similar to that of a character, and the text region may also generate deformation, such as perspective, affine transformation, incomplete, fuzzy, and the like.

Different from the past, with the rise of artificial intelligence, the development of deep neural network learning theory and the rapid iteration of computing hardware, the text detection and identification of natural scenes cater to the wave belonging to the natural scene. The achievement of deep learning in the field of image recognition lays a solid foundation for the solution of the text detection and recognition problem of the natural scene, and the natural scene text detection and recognition subject based on the interdisciplinary subjects of computer vision, natural language processing and the like becomes an important research hotspot for recognizing the text problem of the natural scene. Because deep learning has strong fitting ability, compared with the traditional OCR technology, deep neural network learning has the prospect of solving the problems of text detection and recognition in complex natural scenes.

Disclosure of Invention

It is an object of the present invention to provide a method for text recognition that combines multi-level ternary pivot with two-way long and short term memory that overcomes some or all of the deficiencies of the prior art.

The invention relates to a text recognition method with fusion of multilayer ternary pivot and bidirectional long-short term memory, which comprises the following steps:

firstly, inputting a scene image containing text information into a scene text model;

secondly, obtaining image characteristic output through a multilayer ternary principal component network TPCANet model based on multilayer fusion;

inputting the image characteristics into a BLSTM network of a long-term and short-term memory network to predict the confidence coefficient of k corresponding anchor frames on each pixel point;

fourthly, inputting full connection to predict the most possible coordinates of the text box;

fifthly, cutting the target text box on the original image according to the coordinates of the text box;

inputting the cut text box into a multilayer ternary principal component network (TPCANet) model based on multilayer fusion, extracting characteristics including more text information and space information, and outputting the characteristics;

seventhly, continuously inputting the probability of characters corresponding to the BLSTM network prediction characteristic sequence of the long-term and short-term memory network;

and eighthly, inputting the sequence with the probability into the sequence with the maximum probability predicted by the CTC network to realize transcription, thereby outputting the required text sequence.

Preferably, the multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k₁×k₂The radius of the ternary neighborhood is r;

step1, inputting an image data set I containing text characters;

step2 for input image sample I_iSampling blocks with neighborhood radius r, and carrying out ternary operation processing;

step3, carrying out de-equalization processing on the image samples subjected to the tri-equalization processing at each Step2, and cascading all the image samples subjected to the de-equalization processing to form a matrix;

step 4. principal component analysis is carried out on the matrix generated in Step3 to obtain a first stage L₁A convolution kernel

Step5 use of first phase L₁Convolution check of original image I_iPerforming convolution to obtain corresponding L₁Characteristic image

Step6, carrying out de-equalization processing on the characteristic images generated in the first stage of the whole image data set, and cascading all the characteristic images subjected to de-equalization processing to form a matrix;

step7, performing principal component analysis on the matrix generated in Step6 to obtain L2 convolution kernels in the second stage

Step8 Using the second phase L₂The convolution kernel checks the ith original image I_iIn the first stage from₁Characteristic image obtained by convolution kernel convolution of first layer

Making convolution to obtain corresponding L₂Characteristic image

Thus, the ith original image I_iIn total will generate L in the second stage₁×L₂The characteristic image:

wherein l₁＝1,2,…，L₁；

Step9, carrying out weighted fusion on the result generated by the convolution of the ith image sample obtained in Step5 and Step8 in the first stage and the convolution result generated in the second stage

Preferably, in the second step, feature extraction is performed first, then sliding sampling is performed on a multi-size window, and sliding sampling is performed on the convolution result with spatial information by using a 3 × 3 spatial sample block.

Preferably, in step three, BLSTM will be paired from two directionsCoding a cyclic context, circularly and sequentially inputting the convolution characteristics of each sliding window into two LSTM networks from two directions in a cyclic mode, and updating H in the hidden layer of the internal characteristics of the LSTM networks_tThe cycling state of (a);

H_t＝φ(H_t-1,X_t),t＝1,2,...,W；

wherein H_tIs a cyclic internal state, from the current input X_tAnd H_t-1The first side state of the middle code is obtained by common calculation; x_t∈R^3×3×CIs a convolution characteristic of the tth sliding sample window.

The text recognition method with the fusion of the multilayer ternary pivot and the bidirectional long-term and short-term memory has a good feature extraction effect and has a good recognition effect on natural scene text recognition.

Drawings

FIG. 1 is a flowchart of a text recognition method of merging multi-layer ternary pivot elements with two-way long-short term memory in embodiment 1;

FIG. 2 is a schematic diagram of the bidirectional LSTM in example 1;

FIG. 3 is a schematic representation of the CTC transcription process in example 1;

fig. 4 is a schematic diagram illustrating recognition of a natural scene text recognition application in embodiment 1.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.

Example 1

As shown in fig. 1, the present embodiment provides a text recognition method fusing a multi-layer ternary pivot and a bidirectional long-short term memory, which includes the following steps:

The multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k₁×k₂The radius of the ternary neighborhood is r;

step1, inputting an image data set I containing text characters;

Making convolution to obtain corresponding L₂Characteristic image

wherein l₁＝1,2,…,L₁；

In the second step, feature extraction is firstly carried out, then sliding sampling is carried out on a multi-size window, and sliding sampling is carried out on the convolution result with the spatial information by adopting a 3X 3 space sampling block. The sliding window approach employs multi-scale windows to detect objects of different sizes and employs a Vertical Anchor Mechanism (Vertical Anchor Mechanism) to predict the location and Text/non-Text score of each fixed-width Text offer (Text prompt).

These sequential sliding sampling results will be circularly input to BLSTM and predict the confidence of these text line slices (text suggestion boxes), respectively. Due to the unique bi-directional cyclic linking mechanism of BLSTM, the detector can explore the context information to the text line. BLSTM will encode the cyclic context from two directions, colloquially speaking, the convolution feature of each sliding window is circularly and sequentially input into two LSTM networks from two directions, and update the cyclic state of Ht in its internal feature hidden layer:

H_t＝φ(H_t-1,X_t),t＝1,2,...,W；

wherein H_tIs a cyclic internal state, from the current input X_tAnd H_t-1The first side state of the middle code is obtained by common calculation; x_t∈R^3×3×CIs a convolution characteristic of the tth sliding sample window. The process is shown in figure 2.

CTC is a technology for summarizing continuous characteristics between characters, is used to solve a problem that input/output data tags do not correspond to each other, and is widely used in text line recognition and speech recognition where input/output cannot be aligned. The CTC core technique is a loss function that measures how much the input sequence differs from the true output after passing through the neural network. Mathematically, the loss calculation is to solve the summary of the overall probability, solve the tag sequence with the highest probability, and then output the corresponding text sequence. I.e. maximizing the posterior probability P (Y/X) based on a given input X. FIG. 3 is the transcription of the input signature sequence by CTC, X₀To X₁4 is an enumerated sequence of features, which contain positional relationships between the features, which are sequentially input into the CTC model. CTC will solve for the maximum probability sequence for the signature sequence. The blocks in the figure represent the probability that the current feature is a certain character, with deeper representing higher probability.

The convolution kernel solution of the multilayer fusion TPCANet feature extraction model provided by the embodiment is based on a principal component analysis principle and is obtained by solving the feature matrix of the covariance matrix, and is different from the process that other deep convolution data neural networks have backward propagation, so that the convolution kernel solution can be integrated into the existing two-segment natural scene text recognition model.

Experimental testing and analysis

Introduction of data set:

in the experiment, a test experiment is carried out on the natural scene text recognition model based on the multilayer fusion TPCANet, which is provided in the chapter, on the ICDAR 2003, the ICDAR 2015 and the SVT data set. These data sets provide not only rich natural scene pictures containing textual information, but also the positioning of text regions in the image and the corresponding text. Taking SVT datasets as an example, the datasets are from google street view images, the datasets contain high quality photos and a large number of low quality photos, and the images are provided with a train file and a test file in XML format, respectively holding the coordinates of the text regions of the images and the corresponding text sequences.

Data augmentation and data set partitioning:

first the dataset is augmented, one image is augmented by two, then the total number of data samples will increase to 3 times the original size. And because the data sets of the experiment are numerous and the data set quantities are inconsistent, the automatic division mode with repeated sampling is adopted for the division of the data sets to select the training set and the test set. That is, each time a sample is taken from the data set as an element in the training set, then the sample is put back, and the action is repeated M times, so that a training set with a size of M can be obtained, wherein some samples are repeatedly sampled, and some samples are not sampled, so that the part of samples which are not sampled can be used as a test set. In this way, the test set partitioned from the data set is roughly 1/e of the total data set:

where M is the data set size and 1-1/M is the probability that each sample was not sampled.

Evaluation criteria:

the evaluation criteria selected herein during the text detection phase comprises three parts: recall (Recall), accuracy (Precision), and harmonic mean (FMeasure).

The evaluation criteria selected herein during the text detection phase comprises two parts: standard edit distance metric and word recognition rate.

The standard edit distance is the minimum number of times required for converting a sequence into another character through editing operation, and is based on the sum of normalized edit distances between the standard edit distance measurement recognition result and the real character, and the character string S_iAnd S_jNormalized edit distance of (1).

The character recognition rate is another evaluation standard for analyzing the performance of the text recognition model, and is the ratio of the total number of correctly recognized characters to the total number of all characters to be recognized. The character recognition rate is further divided into evaluation criteria of dictionary presence and dictionary absence according to the presence or absence of constraints. Dictionary-constrained transcription finds the character with the minimum edit distance from the original output from the dictionary, and dictionary-free transcription directly takes the maximum probability label value predicted at the time t as a result.

Experiment one:

firstly, the performance and detection performance of the natural scene text recognition model based on multi-layer fusion TPCANet, which is proposed in this chapter, on a horizontal text data set (ICDAR 2003), a slant distortion text data set (ICDAR 2015) and a data Set (SVT) with large font variation and lower image resolution are verified respectively.

TABLE 1

Data set	Recall rate	Rate of accuracy	Harmonic mean
				ICDAR 2003	84.21％	93.0％	88.2％
ICDAR 2015	52.12％	71.9％	64.43％
				SVT	69.0％	81.9％	78％

As shown in table 1, in this experiment, the recall rate of the detection phase of the text recognition model of natural scene on the data set of ICDAR 2003 was 84.21%, the accuracy was 93.0%, and the reconciliation averaged 88.2%. Through experimental data analysis, the detection accuracy of the text detection model of the natural scene on an image data set containing horizontal text is obviously higher than that of a data set containing a distorted and inclined text image.

Experiment two:

the performance and recognition performance of the natural scene text recognition model based on the multi-layer fusion TPCANet, which is proposed in this chapter, on a horizontal text data set (ICDAR 2003) and a data Set (SVT) with large font variation and low resolution of most images are verified respectively.

TABLE 2

Data set	Rate of accuracy
		ICDAR 2003	89％
SVT	74.23％

As shown in table 2, experiments show that the natural scene text recognition model based on the multi-layer fusion TPCANet proposed herein performs well on a text recognition data set in the horizontal direction. Through experimental data analysis, the detection accuracy of the recognition stage of the text recognition model of the natural scene in the image data set ICDAR 2003 containing horizontal text is obviously higher than that in the data set SVT containing twisted and inclined text images by 14.77%.

Experiment three:

based on the above experimental results, a comparison of the recognition rates of the different models was next performed on a data set containing horizontal text images (ICDAR 2003).

TABLE 3

Method model	Rate of identification accuracy
		MTPCANet-CTPN-CRNN	81.9％
CTPN+CRNN	71.9％

As shown in table 3, experiments show that the accuracy of the natural scene text recognition model based on multi-layer fusion TPCANet (MTPCANet-CTPN-CRNN) proposed herein on the ICDAR 2003 data set is 81.9%, which is a little better than the original CTPN + CRNN model.

Finally, the test sample of the natural scene text recognition application based on the multi-layer fusion TPCANet is provided, and as shown in FIG. 4, the original image, the text result and the text recognition are sequentially output. In the text box detection, a box is a text box area obtained by text detection prediction, and the number in the box is the sum of the confidence degrees of the anchor points of the text line.

Experimental results show that the natural scene text recognition model based on the multi-fusion TPCAnet integrated in the text can be slightly lower than a classic CTPN and CRNN combined model in training time, recognition accuracy is slightly improved compared with a classic model, and the natural scene text application based on the multi-layer fusion TPCAnet built in the text has certain practical significance.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A text recognition method fusing multilayer ternary pivot elements and bidirectional long-short term memory is characterized in that: the method comprises the following steps:

2. The method as claimed in claim 1, wherein the text recognition method comprises the following steps: the multilayer fused TPCANet model algorithm process is as follows: let the data set have N training samples of size mxn, and let the filter size always be k₁×k₂The radius of the ternary neighborhood is r;

step1, inputting an image data set I containing text characters;

step7, performing principal component analysis on the matrix generated in Step6 to obtain a second stage L₂A convolution kernel

Making convolution to obtain corresponding L₂Characteristic image

wherein l₁＝1,2，…，L₁；

3. The method as claimed in claim 2, wherein the text recognition method comprises the following steps: in the second step, feature extraction is firstly carried out, then sliding sampling is carried out on a multi-size window, and sliding sampling is carried out on the convolution result with the spatial information by adopting a 3X 3 space sampling block.

4. The method as claimed in claim 3, wherein the text recognition method comprises the following steps: in step three, BLSTM encodes the cyclic context from two directions, the convolution characteristic of each sliding window is circularly and sequentially input into two LSTM networks from two directions, and H in the hidden layer of the internal characteristic is updated_tThe cycling state of (a);

H_t＝φ(H_t-1,X_t),t＝1,2,...,W；

wherein H_tIs a cyclic internal state, from the current input X_tAnd H_t-1The first side state of the middle code is obtained by common calculation; x_t∈R³ ^×3×CIs a convolution characteristic of the tth sliding sample window.