CN113723387A

CN113723387A - Chinese ancient book non-standard font recognition system based on deep learning

Info

Publication number: CN113723387A
Application number: CN202110771999.XA
Authority: CN
Inventors: 叶鸿; 袁晓宇; 杲靖; 潘群; 陆志高; 赵若凡; 王嘉成
Original assignee: Changzhou Institute of Technology
Current assignee: Changzhou Institute of Technology
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-11-30

Abstract

The invention discloses a Chinese ancient book non-standard font recognition system based on deep learning, which comprises the following steps: step1, pretreatment: firstly, carrying out binarization, edge cutting, center of gravity and character line density balancing on a document image; step2, character segmentation: carrying out character segmentation by utilizing horizontal projection or vertical projection; step3, feature extraction and dimensionality reduction: extracting a feature vector by using WDCH, performing feature dimension reduction by using an LDA method, and using a convolution layer as a feature extractor; step4, deep learning; step5, constructing a prototype. The system adopts a newly acquired ancient book non-standard font data set and uses a deep neural network for transfer learning; the system is based on an ancient book non-standard font recognition model, and a prototype system is realized by using a Web development technology. The system improves the recognition accuracy and recognition efficiency of the non-standard character types of the ancient books and improves the operation interactivity of the recognition system.

Description

Chinese ancient book non-standard font recognition system based on deep learning

Technical Field

The invention relates to the field of Chinese character recognition systems, in particular to a Chinese ancient book non-standard font recognition system based on deep learning.

Background

China has more than 5000 million books of ancient books, which have historical documentarities due to the long age of the ancient books, and academic data of important reference values of book contents, wherein the ancient book engraving plate has artistic representativeness due to exquisite printing and exquisite insets. However, over time and under the influence of human activities, these cultural heritages are constantly being eroded and destroyed. The ancient book digitization refers to the process of processing ancient books by modern information technology to convert the ancient books into electronic data. When preserving ancient book documents, libraries typically store their images in the form of digital images. However, in addition to protecting documents, when ancient book document data is fully utilized, it is necessary to convert characters in digital images into text data. The ancient book identification is a general term of the process, and particularly refers to computer ancient book document image identification, namely, converting a digital photo of an ancient book document into a text which can be edited and retrieved by a computer. At present, only a small part of ancient book documents are identified through manual marking of experts, and a large amount of time and cost are needed for realizing full-text digitization of the ancient book documents. The reasons why the Optical Character Recognition (OCR) technology is directly applied to ancient book documents cannot achieve satisfactory results include many types of Chinese characters, many variant fonts, many layout styles, and defects in layouts. Essentially, the recognition of the digital photos of the ancient books belongs to image recognition, and the machine learning method is expected to improve the ancient book recognition efficiency and accelerate the full-text digitization process of the ancient books, thereby being beneficial to spreading long-standing and splendid Chinese civilizations to the world.

Ancient book digitization is mature in technology. Mainly embodied in a collection of most representative 'big heads', such as the book of four Books , the book of four foot databases , and so on, which have successfully realized image scanning. However, for chinese ancient books with characters as the main material, image scanning is in many cases only a preprocessing of text digitization. The current popular ancient book digitization process and technology can be summarized as the attached figure 1 of the specification, and the automatic transcoding of image and text is usually realized through an OCR technology. For example, seven hundred million characters compiled in the same format, four foot keys , are recognized, converted and collated by OCR, which has indeed received good practical effects. The OCR system can accumulate the image-text corresponding relation in the recognition process, and the OCR system is also a model training process. The recognition result is checked and marked, and then submitted to the ancient book full-text database and stored by the Unicode internal code.

Although OCR technology has greatly improved ancient book recognition efficiency over manual entry of text. Practice has shown, however, that OCR systems do not work well in dealing with non-official improvement books. The reason is that most ancient books, especially images with handwritten Chinese characters, have the problems of multiple Chinese character types, multiple variant fonts, multiple layout styles, defect layouts and the like. The ancient book identification method based on handwritten Chinese character identification is expected to solve the problems.

The traditional Chinese character recognition mainly adopts the methods of extracting feature vectors, reducing dimensions and classifying devices. Extracting the feature vector aims at extracting information, and Kimura et al improves the recognition effect of Japanese Chinese characters by using Weighted Direction Code Histogram (WDCH); ojala et al, proposes a theoretically very simple but effective multi-resolution method for non-parametric discriminative grayscale and rotation invariant texture classification based on Local Binary Patterns (LBPs) and sample and prototype distributions; the Histogram of Oriented Gradients (HOG) method proposed by Dalal & Triggs also belongs to the feature vector-based identification method. The dimensionality reduction is to transform the feature vector in a high-dimensional space into a feature vector in a low-dimensional space, and to reserve information favorable for classification in the low-dimensional space as far as possible. Next, the feature vectors are input into a classifier, which needs to be trained by a supervised Model to classify feature vectors of unknown classes, and common classifiers include Linear Discriminant Analysis (LDA), Modified Quadratic Discriminant Function (MQDF), Support Vector Machine (SVM), Hidden Markov Model (HMM), and artificial Neural Network based on associative principle, especially Deep Neural Network (DNN), etc. DNN is a type of Deep learning, the basis of which is decentralized representation in machine learning, and other Deep learning frameworks include Deep Belief Networks (DBNs), Convolutional Neural Networks (CNNs), and the like. Research has been conducted to identify chinese characters using deep learning. The method for recognizing handwritten Chinese characters based on deep learning is still used for ancient book recognition and still needs to be solved. Firstly, a high-quality data set for supervised learning is lacked, and a large amount of labor cost is required for the basic work. Secondly, due to the difference of the publication times, regions and printing quality, the ancient book Chinese character fonts have larger differences and have the characteristic of changeable styles compared with the ancient script fonts with standard fonts. In addition, the character type defect/redundancy caused by poor ancient book printing quality or poor preservation leads to the increase of the difficulty of recognition work.

In summary, there is no integrated solution developed for the informatization field of the ancient Chinese books, and the recognition effect of the non-standard Chinese character type of the ancient Chinese books cannot be improved.

Disclosure of Invention

In view of the problems mentioned in the background, the present invention is to provide a Chinese ancient book non-standard font recognition system based on deep learning to solve the problems mentioned in the background.

The technical purpose of the invention is realized by the following technical scheme:

the Chinese ancient book non-standard font recognition system based on deep learning comprises the following steps:

step1, pretreatment: firstly, carrying out binarization, edge cutting, center of gravity and character line density balancing on a document image;

step2, character segmentation: carrying out character segmentation by utilizing horizontal projection or vertical projection;

step3, feature extraction and dimensionality reduction: extracting a feature vector by using WDCH, performing feature dimension reduction by using an LDA method, and using a convolution layer as a feature extractor;

step4, deep learning;

step5, constructing a prototype.

Preferably, the correction at Step1 is first applied to each image to ensure that any lines and boundaries present are close to perfectly horizontal or vertical, then a projection-based algorithm is applied to the transformed image to identify continuous and segment horizontal and vertical segments of sufficient length that they are unlikely to be character components, and the identified lines are erased from the image.

Preferably, Step2 adopts column segmentation as the first Step of character segmentation; column identification and simple assumptions about reasonable column widths are then done again using a projection-based algorithm and columns are separated using line-removed blanks; a similar projection step is then applied to each column to identify a single character, taking into account the constraints of the character size.

Preferably, different convolution network configuration sets are adopted in Step3 to perform experiments, and different convolution layers and pooling layer combinations are set to test the optimal architecture of the identification model; firstly, completely training a network on a given task; after training is stopped, the minimum error in the network is reserved on the verification data set, the number of the neurons in the output layer is changed to match the number of classes in the new classification task, the weight of the output layer is randomly reinitialized, and the weights of the rest layers cannot be modified; then the network pre-trained on the source domain is retrained on the target domain; the error rate of the net with the pre-trained weights is compared to the error rate of the randomly initialized net.

Preferably, Step4 is identified by using a convolutional neural network.

Preferably, the system further comprises a prototype construction step, the prototype construction method is based on a general flow of a Web application program, the system comprises a front end and a back end, and the MVC general mode of the language is realized by respectively adopting a popular frame of a JavaScript language and Python and node.

Preferably, Step4 is trained using a transfer learning model using a handwritten chinese character dataset and a CASIA dataset, and the DNN structure trained by the transfer learning model includes convolutional layers, max pooling layers and fully-connected layers, each of which performs a 2D convolution of its Mn-1 input map, and the activation of the resulting Mn output map is given by the sum of the Mn-1 convolution responses through a non-linear activation function, the activation function having the form:

wherein n is the number of layers; y is a graph of size Mx and My; wij is a filter of size Kx and Ky connecting the input graph i and the output graph j; denotes a convolution operation;

the output of the pooling layer is given by the maximum activation of non-overlapping rectangular regions of size Kx and Ky, and the input image is downsampled in each direction at the size of Kx and Ky.

Selecting kernel sizes of convolution filters and max-pooling rectangles such that the output map of the last convolutional layer is down-sampled to 1 pixel per mapping, or a fully-connected layer combines the output of the last convolutional layer into 1D eigenvectors, the last layer always being a fully-connected layer, each class has one output unit in the recognition task, a Softmax activation function is used on the last layer such that the output activation of each neuron can be interpreted as the probability of a specific input map belonging to the class;

DNN of a source domain and DNN of a target domain of the transfer learning are trained by using common algorithms such as gradient descent and the like; training is stopped when the validation error becomes zero, the learning rate reaches its predefined minimum, or a set round value is reached where the validation set recognition rate cannot be improved.

Preferably, the target domain is connected with a recognition and proofreading module, the recognition and proofreading module comprises result evaluation and manual verification, the recognition and proofreading module is connected with a database engine, the database engine comprises an image database, a dictionary database and a full-text database, and the database engine is connected with a front end.

In summary, the invention mainly has the following beneficial effects:

the system adopts a newly acquired ancient book non-standard font data set and uses a deep neural network for migration learning;

secondly, the system is based on an ancient book non-standard font recognition model, and a prototype system is realized by using a Web development technology;

thirdly, the system improves the recognition accuracy and recognition efficiency of the non-standard character types of the ancient books and improves the operation interactivity of the recognition system.

Drawings

FIG. 1 is a system block diagram of the present system.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, the system for recognizing the non-standard character pattern of the Chinese ancient book based on deep learning comprises the following steps:

step4, deep learning;

step5, constructing a prototype.

Where Step1 first corrects each image to ensure that any lines and boundaries present are close to perfectly horizontal or vertical, then applies a projection-based algorithm to the transformed image to identify continuous and segment horizontal and vertical segments of sufficient length that they are unlikely to be character components, and erases the identified lines from the image.

Wherein, Step2 adopts column segmentation as the first Step of character segmentation; column identification and simple assumptions about reasonable column widths are then done again using a projection-based algorithm and columns are separated using line-removed blanks; a similar projection step is then applied to each column to identify a single character, taking into account the constraints of the character size.

In Step3, different convolution network configuration sets are adopted for experiments, and different convolution layers and pooling layer combinations are set for testing the optimal architecture of the recognition model; firstly, completely training a network on a given task; after training is stopped, the minimum error in the network is reserved on the verification data set, the number of the neurons in the output layer is changed to match the number of classes in the new classification task, the weight of the output layer is randomly reinitialized, and the weights of the rest layers cannot be modified; then the network pre-trained on the source domain is retrained on the target domain; the error rate of the net with the pre-trained weights is compared to the error rate of the randomly initialized net.

Step4 adopts a convolutional neural network for identification.

The system comprises a front end and a back end, and the MVC universal mode of the language is realized by respectively adopting a popular frame of a JavaScript language, Python and node.js as algorithms.

Wherein Step4 adopts the transfer learning model training, the transfer learning model training uses the handwritten Chinese character data set, and uses the CASIA data set, the DNN architecture of the transfer learning model training comprises convolution layers, a maximum pooling layer and a full-link layer, each convolution layer performs 2D convolution of its Mn-1 input graph, the activation of the Mn output graph thus generated is given by the sum of Mn-1 convolution responses through a nonlinear activation function, and the activation function has the following form:

The target domain is connected with the recognition and correction module, the recognition and correction module comprises result evaluation and manual verification, the recognition and correction module is connected with a database engine, the database engine comprises an image database, a dictionary database and a full-text database, and the database engine is connected with a front end.

The system adopts a newly acquired ancient book non-standard font data set and uses a deep neural network for transfer learning; the system is based on an ancient book non-standard font recognition model, and a prototype system is realized by using a Web development technology; the system improves the recognition accuracy and recognition efficiency of the non-standard character types of the ancient books and improves the operation interactivity of the recognition system.

Example 2

Referring to fig. 1, the technical scheme of the Chinese ancient book non-standard font recognition system based on deep learning includes four aspects of character segmentation, feature extraction, deep learning and prototype construction.

The document image needs to be preprocessed, binarized and the like, and then character segmentation is carried out to extract each Chinese character. Ancient Chinese text follows a relatively fixed layout, with the text organized into columns, each column being read from top to bottom, and the columns themselves being read from right to left. Typically the division between these columns is marked by vertical lines drawn on the page. One or more solid borders are also drawn around the area of the page that is used for writing. Furthermore, since many of the "pages" considered here are originally half of the wider sheets printed together, a list of information (text or non-text, possibly containing information on rolls, books, pages, etc.) is often found on the far left or right of each page, which should be ignored in reading the content of the text itself, i.e., this information is not part of the main text stream. The above information of a typical page layout can be utilized in image pre-processing to significantly simplify the flow. To perform this step, each image is first corrected to ensure that any lines and boundaries present are as close to perfectly horizontal or vertical as possible. A projection-based algorithm is then applied to the transformed image to identify continuous and segment horizontal and vertical lines of sufficient length that they are unlikely to be character components, and the identified lines are erased from the image.

These properties, which may greatly simplify character segmentation, are that the ancient characters are aligned in the horizontal and vertical directions or have a substantially constant aspect ratio. Character size ratios can vary widely, and grid-based layouts, while common, are less common than multi-column layouts. Column segmentation is therefore the first step as character segmentation. Starting from the pre-processed image, column identification is done again using a projection-based algorithm and a simple assumption about reasonable column width, separating columns with blanks after line removal. A similar projection step is then applied to each column to identify a single character, taking into account the constraints of the character size.

In order to effectively extract the essential features of the character image, the Chinese characters are generally subjected to feature extraction before classification and identification, rather than being identified by pixels. Common features are classified as structural features and statistical features. The former aims at a certain part of a Chinese character, is similar to the process of writing the Chinese character by human, can better reflect the structural characteristics of the Chinese character, but has large calculated amount and poor anti-interference performance. The latter, in order to overcome this drawback, extracts features for the whole chinese character. Feature extraction is usually combined with dimensionality reduction, and features are reduced in dimensionality by using methods such as WDCH extraction of feature vectors, LDA and the like. The convolutional layer also serves as a feature extractor.

Because only the character image recognition after the preprocessing is involved, not the Chinese character recognition in the handwriting process, the convolutional neural network is mainly adopted. In order to explore the performance of the deep convolutional neural network in classifying calligraphy Chinese characters, different convolutional network configuration sets are adopted for experiments. Different convolutional and pooling layer combinations are typically provided to test the optimal architecture of the recognition model. And different classes of fully connected layer classifiers are investigated, such as class 3755 of GB2312-80, 27533 of GB18030-2000, and even 70244 of GB 18030-2005. The network is first fully trained on a given task. After training is stopped, the minimum error in the network is preserved on the validation dataset and the number of neurons in the output layer is modified to match the number of classes in the new classification task. The output layer weights are re-initialized randomly and the weights of the remaining layers are not modified. The pre-trained network on the source domain is then retrained on the target domain. To understand how training additional layers affects performance, only the last layer is trained first, then the last two layers are trained. Since the largest pooling layers have no weights, they are neither trained nor retrained. For the transfer learning performance evaluation, the error rate of the network with the pre-training weight is compared with the error rate of the randomly initialized network.

Finally, from the perspective of software design, a prototype capable of meeting basic functional requirements is constructed for a final product, and the intelligent interactivity of the ancient book identification system can be better expressed. The method for constructing the prototype is based on the general flow of the Web application program, and the system comprises a front end and a back end, and the MVC general mode of the language is realized by respectively adopting a popular frame of a JavaScript language and Python and node.

The transfer learning model training needs to use a handwritten Chinese character data set and a Chinese academy of sciences CASIA data set. The DNN architecture mainly includes a convolutional layer, a max-pooling layer, and a full-link layer. Each convolutional layer performs a 2D convolution of its Mn-1 input map, and the activation of the resulting Mn output map is given by the sum of the Mn-1 convolution responses by a nonlinear activation function, which has the form.

Wherein n is the number of layers; y is a graph of size Mx and My; wij is a filter of size Kx and Ky connecting the input graph i and the output graph j; denotes convolution operation.

The output of the pooling layer is given by the maximum activation of non-overlapping rectangular regions of size Kx and Ky. Maximum pooling produces slight positional invariance over a large local area and downsamples the input image in each direction by the magnitude of Kx and Ky.

The kernel sizes of the convolution filter and the largest pooling rectangle are chosen such that the output map of the last convolutional layer is downsampled to map 1 pixel each, or the fully-connected layer combines the output of the last convolutional layer into a 1D feature vector. The last layer is always a fully connected layer, with one output unit per class in the recognition task. The Softmax activation function is used for the last layer so that the output activation of each neuron can be interpreted as the probability of a particular input map belonging to that class.

The DNN of the source domain and the DNN of the target domain of the transfer learning are trained by using common algorithms such as gradient descent and the like. Training can be stopped when the validation error becomes zero, the learning rate reaches its predefined minimum, or a set round value is reached where the validation set recognition rate cannot be improved.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The Chinese ancient book non-standard font recognition system based on deep learning is characterized in that: the method comprises the following steps:

step4, deep learning;

step5, constructing a prototype.

2. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: the correction at Step1 is first applied to each image to ensure that any lines and boundaries present are close to perfectly horizontal or vertical, then a projection-based algorithm is applied to the transformed image to identify continuous and segment horizontal and vertical segments of sufficient length that they are unlikely to be character components, and the identified lines are erased from the image.

3. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: in Step2, column segmentation is adopted as a first Step of character segmentation; column identification and simple assumptions about reasonable column widths are then done again using a projection-based algorithm and columns are separated using line-removed blanks; a similar projection step is then applied to each column to identify a single character, taking into account the constraints of the character size.

4. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: different convolution network configuration sets are adopted in the Step3 for experiments, and different convolution layers and pooling layer combinations are set for testing the optimal architecture of the recognition model; firstly, completely training a network on a given task; after training is stopped, the minimum error in the network is reserved on the verification data set, the number of the neurons in the output layer is changed to match the number of classes in the new classification task, the weight of the output layer is randomly reinitialized, and the weights of the rest layers cannot be modified; then the network pre-trained on the source domain is retrained on the target domain; the error rate of the net with the pre-trained weights is compared to the error rate of the randomly initialized net.

5. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: the Step4 uses a convolutional neural network for identification.

6. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: the system comprises a front end and a back end, and the MVC universal mode of the language is realized by respectively adopting a popular frame of a JavaScript language and Python and node.js as algorithms.

7. The deep learning based Chinese ancient book non-canonical font recognition system according to claim 1, wherein: step4, using a transfer learning model training using handwritten Chinese character data sets and CASIA data sets, the DNN architecture of the transfer learning model training including convolutional layers, max pooling layers and fully-connected layers, each of the convolutional layers performing a 2D convolution of its Mn-1 input map, the activation of the resulting Mn output map being given by the sum of Mn-1 convolution responses through a nonlinear activation function, the activation function having the form:

8. The deep learning based non-canonical character form recognition system for Chinese ancient books according to claim 7, wherein: the target domain is connected with a recognition and proofreading module, the recognition and proofreading module comprises result evaluation and manual verification, the recognition and proofreading module is connected with a database engine, the database engine comprises an image database, a dictionary database and a full-text database, and the database engine is connected with a front end.