CN111738167A

CN111738167A - Method for recognizing unconstrained handwritten text image

Info

Publication number: CN111738167A
Application number: CN202010589597.3A
Authority: CN
Inventors: 周度; 毛慧芸; 刘曼飞
Original assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-02

Abstract

The invention discloses a method for identifying an unconstrained handwritten text image, which comprises the following steps: s1, preprocessing input unconstrained handwritten texts to obtain preprocessed text data; s2, generating a text characteristic sequence on the basis of the preprocessed text data obtained in the step S1; s3, on the basis of the text characteristic sequence obtained in the step S2, text characteristics are extracted through a multilayer distillation GRU network in time sequence dimension; and S4, outputting an identification result through a CTC transcription layer. The invention not only can effectively process the continuous writing problem in the hand-written text characters, but also can effectively process the unconstrained space relation between the characters, which includes: horizontal writing, longitudinal writing, overlapped writing, multi-line writing, slant writing, steering writing, and the like. The invention combines a large amount of marked unconstrained handwritten texts and can train a system capable of accurately identifying the unconstrained handwritten texts.

Description

Method for recognizing unconstrained handwritten text image

Technical Field

The invention relates to the technical field of computer vision and deep learning, in particular to a recognition method of an unconstrained handwritten text image.

Background

On-line handwritten Chinese text recognition generally refers to a recognition technology that a user writes Chinese characters through handwriting input equipment such as a handwriting pad, a touch screen and a mouse, and a computer converts writing tracks of the Chinese characters acquired by the handwriting input equipment into corresponding internal codes of a Chinese character machine, and the recognition technology is widely applied to the fields of screen handwriting input methods, demonstration interaction, document transmission and the like.

In recognition technology, the Chinese text is very different from the Western language because of the problems of large character set and similar characters. In recent years, researchers have proposed various solutions for recognizing handwritten texts. For example, segmentation-based methods have performed well and have had a large impact. However, when the method is applied, the character segmentation problem is easily encountered, and risks are brought to subsequent identification; the recognition method of the integrated CNN and LSTM requires the network to project the sequential pen point tracks into the feature map through a path signature or an eight-direction feature extraction method, however, the method is not natural enough and is not suitable for the problems of overlapping handwritten texts and the like; the LSTM-based method can naturally capture the temporal dynamics of the pen tip trajectory and has been used for character and text recognition without any pre-processing such as feature extraction and over-segmentation. However, the above methods can only be used to solve the problems of general discontinuous writing and simple horizontal handwritten text, and cannot take into account the unconstrained spatial relationship between characters. Therefore, a new method for recognizing an unconstrained handwritten text image is urgently needed to accurately recognize writing types such as horizontal writing, longitudinal writing, overlapped writing, multi-line writing, oblique writing, and jotting.

Disclosure of Invention

The invention aims to provide a method for recognizing an unconstrained handwritten text image, which solves the problems in the prior art and can accurately recognize texts in writing types such as horizontal writing, longitudinal writing, overlapped writing and the like.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a method for identifying an unconstrained handwritten text image, which comprises the following steps:

s1, preprocessing input unconstrained handwritten texts to obtain preprocessed text data;

s2, generating a text characteristic sequence on the basis of the preprocessed text data obtained in the step S1;

s3, on the basis of the text characteristic sequence obtained in the step S2, text characteristics are extracted through a multilayer distillation GRU network in time sequence dimension;

and S4, outputting an identification result through a CTC transcription layer.

Preferably, the step S1 is as follows: nib trace in handwritten text I₀＝{(x_t,y_t,s_t) T1, 2.., T }; wherein T represents the number of the middle points of the pen point track, T represents the serial number of the middle points of the pen point track, (x)_t,y_t) The abscissa and ordinate, s, of a point of sequence number t_tRepresenting the strokes to which the t points belong specifically; calculating the variation quantity delta x of the horizontal and vertical coordinates between all adjacent points in the T points_t、Δy_tWherein Δ x_t＝x_t+1-x_t,Δy_t＝y_t+1-y_t,t＝1,2,…,T-1。

Preferably, step S2 is specifically as follows: first, the points whose parameters satisfy the following formula are removed to screen the pen-tip locus of the handwritten text preprocessed in step S1:

Δx²+Δy²＜T_dis

after removing the input redundant points, performing feature extraction to generate a four-dimensional feature sequence I, I { (Δ x)_t,Δy_t,I(s_t≠s_t+1),I(s_t＝s_t+1) T ═ 1,2, …, T }; wherein II (. cndot.) has the following meaning: when true, II (·) 1, when false, II (·) 0); wherein, T_disIs a manually set value used to evaluate the Euclidean distance between the current point and the next point; t is_disIs a manually set value used to determine whether a point is in a straight line or near a straight lineOn the trace.

Preferably, the multilayer distillation GRU is a normal GRU plus the following operations: suppose the output h of a normal GRU is equal to (h)₁,h₂,…,h_T),h_T∈R^DFirst, every N time nodes are grouped into one group, and h 'is made (h'₁,h'₂,…,h'_T/N) Wherein h'_t＝[h_t*B+0；h_t*N+1；…；h_t*N+(N-1)],h'_t∈R^ND(ii) a Then, the feature vector h 'is further processed'_t∈R^NDMapping to another feature space

The required mapping formula is as follows:

preferably, the transcription layer is guided by joint temporal classification (CTC) in step S4, and no pre-alignment is required between the input image and its corresponding tag sequence in the training process, and the character set C' is C ∪ { blank }, where C represents all the characters used in the character set and "blank" represents a null mapping, and the input sequence u of a given length T is (u) is₁,u₂,…,u_T) And u is_T∈R^|C′|Obtaining an exponential long tag sequence of length T, represented by a family of pi, by assigning a tag to each time point and concatenating the tags to form a tag sequence; the probability for each sequence is calculated using the following formula:

in CTC, sequence-to-sequence arithmetic

For mapping the aligned transcript L;

deleting the duplicate tag first, and then deleting the "blank" part; the total probability of a transcription is calculated by summing the probabilities of all alignments that correspond to it:

p(L|u)＝∑_π:B(π)＝Lp(π|u)。

meanwhile, the invention also discloses a data enhancement technology for generating a text-level data set by the character-level data set, which comprises the following steps: in character-level data, using

And

respectively representing the minimum and maximum coordinate values of the ith character in the X-axis direction

And

x-axis coordinates representing the first and last points of the ith character, respectively, by Δ X_rRepresents a random bias term, uniformly distributed within (-2, 13), and expressed as Δ x_lineThe length of the text line in the X-axis direction is represented, and the same definition is also made on the Y-axis; the following types of handwritten text are synthesized: level of

Vertical direction

Overlap

Multiple rows and columns

Steering

Tilting

It is expressed by the following formula:

in the operation of generating the unconstrained handwritten text, the type of the handwritten text to be generated and the number N of characters are first decided, then N character samples are randomly selected from the character data set, and finally the selected character samples are combined, and the calculated distance (Δ x, Δ y) between adjacent characters is set to generate the unconstrained handwritten text.

The invention discloses the following technical effects: (1) in the technical scheme of the invention, only the change between adjacent points, namely the movement of the pen point is concerned, and the relative track of the movement of the pen point is more stable than the handwriting coordinate and the handwriting distribution, so that the network load can be greatly reduced. (2) The invention not only can rapidly and effectively process the continuous writing problem in the hand-written text characters, but also can effectively process the unconstrained space relation between the characters, which includes: horizontal writing, longitudinal writing, overlapped writing, multi-line writing, slant writing, steering writing, and the like. The invention combines a large amount of marked unconstrained handwritten texts to train a system capable of accurately identifying the unconstrained handwritten texts, and has higher practical value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram of a type of unconstrained handwritten text image;

FIG. 2 is a flow chart of a method for recognition of unconstrained handwritten text images in accordance with the present invention;

FIG. 3 is a detailed flow chart of the present invention for recognizing unconstrained handwritten text;

fig. 4 is a structural diagram of a general GRU network.

In fig. 1, the text image is written in horizontal, vertical, overlapping, oblique, turning, and multi-line types.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures 1 to 4 are described in further detail below.

The invention provides a method for identifying an unconstrained handwritten text image, which comprises the following steps:

an original image acquisition section:

in deep learning, supervised learning requires a large amount of labeled data, specifically, in the present embodiment, various types of pictures as shown in fig. 1 and labels corresponding to the text of "southern university of china" are required. The data sets in this embodiment are from the real training data set CASIA-oldhdb 2.0-2.2 and various types of unconstrained handwritten text sets generated from the CASIA-oldhdb 1.0-1.2 character set using data enhancement techniques described below. CASIA-OLHWDB2.0-2.2 contains 4072 pages for 41710 lines of handwritten text, 1082220 characters, and all characters can be divided into 2650 classes. CASIA-OLHWDB1.0-1.2 contains 3129496 character data.

Data enhancement techniques to generate text-level data sets from character-level data sets: in character-level data, using

And

And

x-axis coordinates representing the first and last points of the ith character, respectively, by Δ X_rRepresents a random bias term, uniformly distributed within (-2, 13), and expressed as Δ x_lineThe length of the text line in the X-axis direction is represented, and the same definition is also made on the Y-axis. The following types of handwritten text are synthesized using the above method: level of

Vertical direction

Overlap

Multiple rows and columns

Steering

Tilting

It is expressed by the following formula:

The hot, unconstrained handwritten text set OHCTR dataset used in the icdra 2013 chinese handwritten text recognition competition was used to test system performance, and in addition, the resultant dataset was also used to evaluate the robustness of the framework. Accuracy-based experimental results on an ICDAR race set. If a common gated recurrent neural network (LSTM) is used in the feature refinement stage, the system recognition rate is 88.31 and the training time is 208 hours. After the gated recurrent neural network (LSTM) of the first two layers in the identification system is replaced by the distilled gated recurrent neural network proposed by the patent, the training time is reduced from the original 208 hours to 95 hours, and the identification rate is 88.33 and basically remains unchanged. After the synthetic text is introduced, the recognition rate of training time is increased from 88.33% to 91.36%, which is greatly improved, and the comparison table is shown in the following table 1:

TABLE 1

Method of producing a composite material	Accuracy AR (%)	Training time (h)
			Multilayer GRU	88.31	208
Distillation GRU	88.33	95
			Synthesized text	91.36	102

The recognition technique described above will be explained by taking the recognition flowchart of the text "ready to be expected by the viewer" shown in fig. 3 as an example,

pretreatment:

various kinds of deformation of handwritten characters are caused by different writing modes of people in the process of writing the characters. Especially for unconstrained handwritten texts, the types of deformation of the same character are very many due to the problems of continuous strokes, dragging strokes and large stroke thickness difference caused by different writing tools. The original image will have some distortion and noise which will also affect the recognition. The preprocessing process can overcome the influence caused by the deformation to a certain extent, fully exerts the performance of the feature extraction and the classifier, and plays an important role in improving the recognition performance.

In particular to the embodiment, the pen-tip trajectory input I of various unconstrained handwritten text images₀Can be abstracted into formula I₀＝{(x_t,y_t,s_t) T1, 2.., T }; where t represents the serial number of the midpoint of the pen tip trace, (x)_t,y_t) The abscissa and ordinate, s, of a point of sequence number t_tThe specific strokes of the t point are expressed, and it needs to be explained that the track drawn by the process of pen point falling and pen lifting is taken as one stroke, not the stroke in the meaning of characters; calculating the variation quantity delta x of the horizontal and vertical coordinates between all adjacent points in the T points_t、Δy_tWherein Δ x_t＝x_t+1-x_t,Δy_t＝y_t+1-y_tT is 1,2, …, T-1. In the prior art, in the process of preprocessing unconstrained handwritten text, a pen point track is generally input into an I₀Abstraction is formula I₀＝{(x_t,y_t,s_t) T1, 2.., T }; the input characteristics x and y cannot be normalized in a given interval, so that extra burden is brought to a network, the complex sequence property of the unconstrained handwritten text cannot be adapted, and therefore, after the pen point track of the input handwritten text is abstracted, only the change between adjacent points of the input handwritten text, namely the relative displacement (delta x) of the pen point, is concerned in the embodiment_t、Δy_t) So that the handwriting is seated due to the relative displacement of the pen pointsThe marks and handwriting distribution are much more stable, from the viewpoint, the series-connected type unconstrained texts have very similar characteristic patterns, and the only difference between different text styles is the movement of pen points between characters, so that the network burden can be greatly reduced compared with the traditional method.

Feature extraction section

A difficulty with pattern recognition is the large semantic gap between the pattern to be recognized and the pattern that can be processed by a computer. The study of text recognition focuses on the understanding of a particular class of images on a two-dimensional plane, and computers are essentially one-dimensional processing machines. Feature extraction is a powerful tool for filling this semantic gap. It resolves the two-dimensional problem into a one-dimensional problem, and the point-shake image expression into a pattern expression form, thereby resolving the problem to be understood into a computational problem that von neumann machines are adept at.

As shown in FIG. 3, with the online text entry system "ready for viewer expectations", the system first removes redundant points in the text. The guiding idea is that if the distance between a certain point and the previous point is very small or the angle formed by the certain point and the previous two points is flat (i.e. almost on a straight line), the point is considered as a redundant point and will be removed, specifically to this embodiment, the point whose parameter satisfies the following formula is removed first to screen the pen-tip trajectory of the preprocessed handwritten text:

Δx²+Δy²<T_dis

after removing the input redundant points, performing feature extraction to generate a four-dimensional feature sequence I, I { (Δ x)_t,Δy_t,I(s_t≠s_t+1),I(s_t＝s_t+1) T ═ 1,2, …, T }; wherein II (. cndot.) has the following meaning: when true, II (·) 1, when false, II (·) 0); t is_disIs a manually set value whose purpose is to evaluate the euclidean distance of the current point from the next point, if this distance is very small, even equal to zero (overlapping points), then the current point isThe dots will be removed. T is_cosIs also a manually set value whose purpose is to determine if a point is on a straight line or near-direct writing and, if so, to remove it directly. The two values need to be manually set, if the two values are too small, sampling points have great redundancy, and the calculation speed is slow; if too large, the sampling point is too sparse again, and the precision is lost. The setting depends on the trade-off between the speed and accuracy of the system operation by the user. And finally, generating a characteristic sequence with each element being a four-dimensional vector by using coordinates of the rest points in the text, wherein i-th strokes in the graph represent that the elements are grouped according to strokes (one stroke is calculated between the pen-down stroke and the pen-up stroke instead of the Chinese character stroke).

After the signature sequence was generated, we input it into a distillation GRU. Conventional Recurrent Neural Networks (RNNs) are used to process sequence data through a recursive mechanism. However, it has poor effect when dealing with long-term sequence problems, and the information weight far away from each other is continuously reduced, almost having no influence on the subsequent results, and meanwhile, it also increases the calculation burden, and the system performance is greatly influenced. Gated recurrent neural networks can solve such problems very well. The GRU model is a variant of the LSTM model, and by integrating the "input gate" and "forget gate" in the LSTM model to obtain an "update gate", the detailed structure of the GRU unit is shown in fig. 4, which can make the network computation less and convergence easier. In the system, the outputs of two GRU units are cascaded on the input of one GRU unit to form a distillation GRU network, and experiments prove that the method can effectively reduce the training time and cannot reduce the recognition rate.

Constructing a model:

the method comprises the steps of building an integral model of the system, and adopting a keras deep learning framework to construct the network of the system in order to improve the efficiency and the calculation speed of the constructed system and meet the requirement of quick response. keras is a high-level neural network API written in pure Python language, to which both tensiflow, thano and CNTK can be applied as a back-end. The system structure diagram of this embodiment is shown in fig. 3, and the method includes acquiring a time sequence input of a handwritten text, extracting a feature map through processing, then passing through a double-layer distillation GRU network including a distillation GRU layer and a normal GRU layer, and finally passing through a CTC transcription layer to obtain a result.

Model training

After the model is designed and sufficient data is available, model training can be performed. The purpose of model training is to compare the output result of a large amount of data after model calculation with the label corresponding to the data, so as to adjust the parameters of the network in the model, and finally the model can identify the data of the same type. In this embodiment, a pen-tip trajectory of a handwritten text is represented by l, where l is { (x)_t,y_t,s_t) Where T denotes the number of dots, (x) 1,2, …, T }, where T denotes the number of dots_t，y_t) Denotes the abscissa, s, of the t-th point_tIndicating to which stroke this point belongs in particular. Given a handwritten text training data set Q and samples (l, z), where z represents the label to which the data corresponds, the loss function l (Q) of the model network is expressed as:

according to the invention, a random steepest descent method SGD is taken as an optimization algorithm, and a GeForceTitan-XGPUs display card is used, so that convergence can be achieved in about 3-4 days.

We assume the character set to be C' ═ C ∪ { blank }, where C denotes all characters that may be included in the text line recognition task and "blank" denotes the output class to be empty and no output₁,u₂,…,u_T) Wherein u is_t∈R^|C'|Here, each time point is assigned with a tag and all time points are combined to obtain a tag sequence, and as a result, a large number of tags (denoted as pi) with a sequence length T can be obtained. The probability for each tag sequence was calculated as:

in CTC, map operations

The tag sequence pi can be mapped to a sequence l, i.e. the final output recognition character sequence.

Duplicate characters in the tag sequence, separated by "blank", are first removed, and then "blank" is removed.

As shown in the figure, here is listed a three tag sequence pi (the space character is replaced with "_" for ease of recognition):

π₁: backup-audience-period-waiting-for

π₂: standby, watched, many, standby

π₃: standby-watched-many period-standby

When such a three-tag sequence is subjected to removal of repeated characters between space characters and then space characters, a correct text sequence of "expected by the viewer" can be output.

The invention combines a large amount of marked unconstrained handwritten texts and can train a system capable of accurately identifying the unconstrained handwritten texts.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A method for recognizing an unconstrained handwritten text image, comprising the steps of:

and S4, outputting an identification result through a CTC transcription layer.

2. The method for recognizing an unconstrained handwritten text image as claimed in claim 1, wherein step S1 is as follows: nib trace in handwritten text I₀＝{(x_t，y_t，s_t) T1, 2.., T }; wherein T represents the number of the middle points of the pen point track, T represents the serial number of the middle points of the pen point track, (x)_t，y_t) The abscissa and ordinate, s, of a point of sequence number t_tRepresenting the strokes to which the t points belong specifically; calculating the variation quantity delta x of the horizontal and vertical coordinates between all adjacent points in the T points_t、Δy_tWherein Δ x_t＝x_t+1-x_t，Δy_t＝y_t+1-y_t，t＝1，2，…，T-1。

3. The method for recognizing an unconstrained handwritten text image as claimed in claim 1, wherein step S2 is as follows: first, the points whose parameters satisfy the following formula are removed to screen the pen-tip locus of the handwritten text preprocessed in step S1:

Δx²+Δy²＜T_dis

after removing the input redundant points, performing feature extraction to generate a four-dimensional feature sequence I, I { (Δ x)_t，Δy_t，I(s_t≠s_t+1)，I(s_t＝s_t+1) T ═ 1,2,. ·, T }; wherein

The meanings are as follows: when the product is true

Is false time

Wherein, T_disIs a manually set value used to evaluate the Euclidean distance between the current point and the next point; t is_disIs a manually set value used to determine whether a point is on a straight line or near-straight line of writing.

4. A method of recognition of an unconstrained handwritten text image as claimed in claim 1, characterized in that said multilayer distilled GRU is a normal GRU plus the following operations: suppose the output h of a normal GRU is equal to (h)₁，h₂，...，h_T)，h_T∈R^DFirst, every N time nodes are grouped into one group, and h 'is made (h'₁，h′₂，...，h′_T/N) Wherein h'_t＝[h_t*N+0；h_t*N+1；...；h_t*N+(N-1)]，h′_t∈R^ND(ii) a Then, the feature vector h 'is further processed'_t∈ RND mapping to another feature space

The required mapping formula is as follows:

5. the method of claim 1, wherein the transcription layer is guided by joint temporal classification (CTC) in step S4, and the input image and its corresponding label sequence do not need to be aligned in advance during training, wherein the character set C 'is C ∪ { blank }, wherein C represents all characters used in the character set, blank' represents null mapping, and the input sequence u with a given length T is (u) is₁，u₂，...，u_T) And u is_T∈R^|C|Obtaining an exponential long tag sequence of length T, represented by a family of pi, by assigning a tag to each time point and concatenating the tags to form a tag sequence; the probability for each sequence is calculated using the following formula:

in CTC, sequence-to-sequence arithmetic

For mapping the aligned transcript L;

delete duplicate tags first, then delete "blank" part; the total probability of a transcription is calculated by summing the probabilities of all alignments that correspond to it:

p(L|u)＝∑_{π：B(π)＝L}p(π|u)。

6. a data enhancement technique for generating a text-level data set from a character-level data set, in which character-level data is generated using

And

And

Vertical direction

Overlap

Multiple rows and columns

Steering

Tilting

It is expressed by the following formula: