AU2020103315A4

AU2020103315A4 - A method for digitizing writings in antiquity

Info

Publication number: AU2020103315A4
Application number: AU2020103315A
Authority: AU
Inventors: Weiguo HUANG; Lianwen JIN; Huiyun MAO; Hailin Yang
Original assignee: Zhuhai Institute Of Modern Industry Innovation South China University Of Technology; South China University of Technology SCUT
Current assignee: Zhuhai Institute Of Modern Industry Innovation South China University Of Technology; South China University of Technology SCUT
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-01-14
Anticipated expiration: 2028-11-09

Abstract

The invention discloses a method for digitizing writings in antiquity. Its specific steps include collecting data and utilizing collected data to train single character detection model so as to obtain single character output results; meanwhile, training single character classification model to get classification result of detected single character and the final document recognition results can be obtained by combining results of single character detection and recognition; extracting lines of document layout with the help of graphic morphology and designing algorithm to solve the problem of biserial interlinear notes, which provides conditions for the structured output of the document. And finally, outputting the digital results of the document corresponding to the images to complete the document digitization work. The method of the invention solves the problems of single character detection in ancient books and documents with complex layout and dense documents and the existence of stain interference in large document background. It has the advantages of simple, efficient and high recognition accuracy. With combination of modem computer information technology and traditional humanistic culture, the invention plays an important and positive role in digital heritage protection, information discovery, paper document transcription, etc. -1/4 |Data acquisition. Collecting and labeling writings in antiquity Training single Training single character character detection model classification model Extraction of and performing a and perfonning a layout lines single character single character detection classification. Document structuring Outputting the final digitized document content Figure 1 The flow chart of the digitization method of the ancient book documents in the invention I c I a d Ftl ig ue i t t~ 4 Cd e Figure 2An introduction to dataset sampling used in the invention

Description

-1/4

|Data acquisition. Collecting and labeling writings in antiquity

Training single Training single character character detection model classification model Extraction of and performing a and perfonning a layout lines single character single character detection classification.

Document structuring

Outputting the final digitized document content

Figure 1 The flow chart of the digitization method of the ancient book documents in the

invention

Ic I a d

Ftl ig ue i t t~ 4

Cd e

Figure 2An introduction to dataset sampling used in the invention

AUSTRALIA

PATENTS ACT 1990

PATENT SPECIFICATION FOR THE INVENTION ENTITLED:

A method for digitizing writings in antiquity

The invention is described in the following statement:-

A method for digitizing writings in antiquity

TECHNICAL FIELD

The invention relates to the technical field of image precise positioning and classification,

in particular to a method for digitizing writings in antiquity.

BACKGROUND

With a long history, Chinese culture is broad and profound. Ancient books and

documents contain all the wisdom essence of China's 5000 years history. They are not

only the traditional proof of China's long-standing culture, but also the foundation of the

Chinese nation. More importantly, they are our indispensable spiritual strength. The

historical relics, academic materials and artistic representativeness of ancient books and

documents play an extremely important role in the study of the social style and

development of production, science and culture in ancient China. China has tens of

thousands of ancient books and documents, which record China's long history and

culture, and are very valuable intangible cultural heritage. In order to avoid the aging or

disappearing of ancient books and documents due to the passage of time, and to excavate

and utilize the rich knowledge contained in ancient books and documents, the digitization

of ancient books and documents is particularly important. Optical character recognition

(OCR) technology is closely related to the digitization of ancient books, that is, the

characters on paper can be read out by optical technology and computer technology, so as

to obtain the corresponding text output results.

In recent years, with the development of deep neural network, OCR technology based on

deep learning has achieved remarkable results in fixed format, such as ID card

verification and license plate recognition, which not only reduces the labour cost, but also greatly improves people's work efficiency. However, the research on the transcription of ancient books and documents is developing slowly. The main technical difficulties include the complexity of ancient books and documents layout, the difficulty of extracting structured output information, the fuzzy image, low resolution and serious background interference, which seriously affect the detection and recognition of characters.

Therefore, there is an urgent need for a simple and efficient method of digitizing writings

in antiquity so as to protect ancient books through timely paper document transcription.

SUMMARY

The purpose of the invention is to provide a method for digitizing writings in antiquity, so

as to solve the problems existing in the prior art and make the ancient books and

documents be accurately transcribed.

To achieve the above purpose, the invention provides the following scheme:

The invention provides a method for digitizing writings in antiquity, including the

following contents.

Si. Data acquisition. Collecting the image data of ancient books and labelling the image

data with single characters and text lines at the space level to obtain the training dataset.

S2. Training and detection of single character detection model. Pre-processing the

training dataset. Based on the general target detection framework YOLO-v3, setting

anchors with different scales and then training the pre-processed training dataset under

the YOLO-v3 detection framework to obtain the single character detection model. Using

the trained single character detection model to input the whole image directly to detect

and getting the single character detection result.

S3. Training and classification of single character classification model. In step 1, the

single character image will be obtained from the labelled single character, and the single

character classification model will be constructed by using convolution neural network.

Then using the single character image to train the single character classification model,

thereby obtaining the single character classification model. The trained single character

classification model is used to input single character images to get the classification

results.

S4. Extraction of layout lines. Detecting the straight-line position in the ancient book

document and extracting the different area blocks based on the content of ancient book to

obtain the position relationship between each area block.

S5. Structured output of document. The digitized ancient book document content is

output through the results of single word detection and classification and the position

relationship between each block obtained in step 4.

Preferably, the ancient books collected in step 1 include simple layout picture TKH and

complex layout images MTH1000 and MTH1200.

Preferably, the content of the single character annotation in step 1 includes the position of

the single character and the classification category corresponding to the single character;

the text line annotation is to annotate the coordinates and the corresponding sequence

content of the text line from right to left and from top to bottom according to the reading

order of ancient books.

Preferably, the data pre-processing in step 3 includes adaptive threshold binarization of

the image data in step Sl, as well as Gaussian noise addition and random white filling or

cutting off some pixel regions.

Preferably, in step S3, according to the morphological expansion corrosion method and

combined with projection method, the invention can extract the straight lines of ancient

book document layout so as to get the position relationship between each block.

Preferably, in step 5, the characters under the double columns are sorted first according to

the coordinates of the character detection and the position extracted from the layout and

then are output.

The invention discloses the following technical effects:

The invention solves the problems of single character detection in ancient writings with

complex layout and dense documents and the existence of stain interference in large

document background.

It can identify the content of ancient books simply and efficiently and combine modem

computer information technology with traditional humanistic culture subtly, which plays

an important role in digital heritage protection, information discovery, paper document

transcription and so on.

BRIEF DESCRIPTION OF THE FIGURES

In order to explain the embodiments of the present invention or the technical scheme in

the prior art more clearly, the figures needed in the embodiments will be briefly

introduced below. Obviously, the figures in the following description are only some

embodiments of the present invention, and for ordinary technicians in the field, other

figures can be obtained according to these figures without paying creative labour.

Figure 1 is the flow chart of the digitization method of the ancient book documents in the

invention.

Figure 2 is an introduction to dataset sampling used in the invention.

Figure 3 is a schematic diagram of the single character classification model of the

invention.

Figure 4 is an example schematic diagram of the detection result of the invention.

Figure 5 is an example schematic diagram of a layout extraction result of the present

invention;

Figure 6 is an example schematic diagram of a structured output result of the present

invention;

Figure 7 is an example schematic diagram of the final result obtained by digitizing

writings in antiquity in the present invention.

Figure 8 is a partial enlarged view of the picture labelled C in Figure 2.

DESCRIPTION OF THE INVENTION

The technical scheme in the embodiments of the present invention will be described

clearly and completely with reference to the figures in the embodiments of the present

invention. Obviously, the described embodiments are only parts of the embodiments of

the present invention, not all of them. Based on the embodiments of the present

invention, all other embodiments obtained by ordinary technicians in the field without

creative labour should belong to the protection scope of the present invention.

In order to make the above objects, features and advantages of the present invention more

obvious and easier to understand, the present invention will be further explained in detail

with reference to the figures and specific embodiments.

As shown in Figs. 1-8, the invention provides a method for digitizing writings in

antiquity and documents, and the specific contents are as follows.

Fig. 1 is the flow chart of the digitization method for ancient books. First of all, obtaining

the ancient book dataset to be digitized. Wherein, the ancient book dataset of the

embodiment is composed of images with simple and complex layouts, named TKH,

MTH1000 and MTH1200 respectively. With a total of 3200 image data, there are 1000,

1000, and 1200 images in turn. Then, the 3200 images data are annotated at the space

level, including text line level and single character level based on the reading order. The

images sampled from the ancient book dataset are shown in Fig. 2. Fig. 8 is an enlarged

view of the picture labelled C in Fig. 2. Characters are divided into common characters

and rare characters. Wherein, the frequency of rare characters is low, and only some

common characters reach the highest frequency. The largest single character in a dataset

has a category of 1000 images. MTH1200 has the largest categories, while TKH has the

least. The specific statistics are shown in Table 1.

Table 1. Statistics of distribution of ancient datasets

TKH MTH1000 MTH1200 Total pages 1000 1000 1200 Total text lines 23468 27559 21416 Total characters 323501 420548 337613 Categories of character 1487 5341 5292 Proportion of double column text 0 9.0% 27% Training single character detection model:

Randomly dividing all 3200 images in the ancient book datasets into training dataset and

test dataset according to the ratio of 4:1, that is, there are 2560 images in training dataset

and 640 images in test dataset. Based on the YOLO-v3 detection model, the detection

results are analysed by comparing the full input method with the slice input method. In

the training process, all 2560 images of the training dataset are scaled to a fixed size of

2048*2048, and then the anchor size is set by K-means clustering method. After training the single character detection model by using the image data in the training dataset, using

640 images in the test dataset to test the trained single character detection model, and the

test results as shown in Table 2. As can be seen from the Table 2, the slice input can

reduce the number of text box in a single image, and significantly improve the index

under high IoU. As a data pre-processing operation, slice input has significant and

general effect in facing the detection of dense objects and high-resolution images. The

single character detection result of the embodiment is shown in Fig. 4.

Table 2. Comparison test results of single character detection

IoU=0.5 IoU=0.6 IoU=0.7 IoU=0.8 Full input 98.32% 97.36% 93.55% 73.28% Slice input 99.22% 98.61% 96.40% 86.66% Training single character classification network model:

After data pre-processing and data enhancement by rotation transformation, the single

character classification network model is shown in Fig. 3. It specifically includes

convolution layer (with convolution kernel size of 3*3, input channel number of 1, output

channel number of 32), regularization layer + Relu activation layer + pooling layer (with

pooling kernel size of 2*2), convolution layer (with convolution kernel size of 3*3, input

channel number of 32, output channel number of 64); regularization layer + Relu

activation layer + pooling layer (with pooling kernel size of 2*2); convolution layer (with

convolution kernel size of 3*3, input channel number of 64, output channel number of

128); regularization layer + Relu activation layer + pooling layer (with pooling kernel

size of 2*2); convolution layer (with convolution kernel size of 3*3, input channel

number of 128, output channel number of 256), regularization layer + Relu activation

layer + pooling layer (with pooling kernel size of 2*2); full connection layer (with 512 output nodes); regularization layer + Relu activation layer + dropout layer (with dropout ratio of 0.3 to prevent over fitting), full connection layer(with 512 input nodes and a single word category as the output node). The accuracy of Top Iand Top-5 in the training of single character classification network is 97.111% and 98.87%, respectively.

The data pre-processing operation includes adaptive threshold binarization of the image

data in step 1, as well as adding Gaussian noise, random whitening, or cutting off part of

the pixel area. Adaptive threshold binarization of the image data can avoid the

interference caused by different image backgrounds. Because binarization often

introduces noise, adding Gaussian noise can increase the generalization ability of the

model. Because the single character detection model cannot guarantee the single

character can be regressed particularly accurate, random whitening can improve the

robustness of the single character classification network model.

Extraction of layout straight line:

Through image processing method, combined with projection method to detect the

location of the straight line in the document, the ancient documents will be extracted

from different areas of the document, and finally get the location of each area of the

block relationship, the effect is shown in Fig.5.

Structured output of documents

The structured output of the ancient documents needs to restore the position of the text

and the content of the document. In particular, a technical problem that needs to be

solved for the structured output of antique documents is how to solve the problem of

double-column annotations in documents. Solving this problem requires outputting the

single column in top-to-bottom order, and then outputting the contents of the double column in right-to-left order. The algorithm shown in the following table pseudocode is designed to solve this problem.

Algorithm- structured output post-processing algorithm

Input- the output result R identified by each detection frame

Output-sorted identification output result 0

1. Sorting the recognition output R of each layout by word width.

2.For i in R.

3.Adding box i to the set A.

4.Forj in R except i.

5. If the left edge of the current box j is close to the left edge of set A, or the right edge of

the box is close, adding box j to the same set A.

6. Updating the left and right edges of the set A.

7. Repeating the operation recursively for the set obtained above, until there are no two

columns in the same set

The final result obtained by entering an image of an ancient document and going through

the document digitization method is shown in Fig. 6.

By analyzing the shortcomings of traditional methods and deep learning methods, this

invention puts forward some new ideas for the digitization of ancient documents, mainly

including the use of sliding window method to improve the accuracy of text detection,

and through the morphological method to get the results of the page extraction faster. As

a result, through the designed recognition network and data enhancement techniques, the

structured output of the double-column text content has the advantages of simple

implementation, high recognition accuracy and recognition speed.

The description of the invention, it needs to be understood that the orientation or position

relationship indicated by the terms "longitudinal", "transverse", "upper", "lower",

"front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inside" and

"outside" are based on the orientation or position relationship shown in the attached

figures, which is only for the convenience of describing the invention, instead of

indicating or implying that the device or element in question must have a specific

orientation, be constructed and operated in a specific orientation, and therefore it cannot

be understood as a limitation of the invention.

The above embodiments only describe the preferred mode of the invention, but do not

limit the scope of the invention. On the premise of not departing from the design spirit of

the invention, various modifications and improvements made by ordinary technicians in

the field to the technical scheme of the invention shall fall within the protection scope

determined by the claims of the invention.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A method for digitizing writings in antiquity is characterized by comprising following

steps.

and getting the single character detection result.

results.

obtain the position relationship between each area block.

S5. Structured output of document

2. The method for digitizing writings in antiquity according to claim 1 is characterized in

that the ancient books collected in step 1 include simple layout picture TKH and complex

layout images MTH1000 and MTH1200.

3. The method for digitizing writings in antiquity according to claim 1 is characterized in

that the content of the single character annotation in step 1 includes the position of the

single character and the classification category corresponding to the single character; the

text line annotation is to annotate the coordinates and the corresponding sequence content

of the text line from right to left and from top to bottom according to the reading order of

ancient books.

4. The method for digitizing writings in antiquity according to claim 1 is characterized in

that the data pre-processing in step 3 includes adaptive threshold binarization of the

image data in step 1, as well as Gaussian noise addition and random white filling or

cutting off some pixel regions.

5. The method for digitizing writings in antiquity according to claim 1 is characterized in

that in step S3, according to the morphological expansion corrosion method and

book document layout so as to get the position relationship between each block.

6. The method for digitizing writings in antiquity according to claim 1 is characterized in

that in step 5, the digitized ancient book document content is output through the results of

single word detection and classification and the position relationship between each block

obtained in step 4.

-1/4- 09 Nov 2020 2020103315

invention

Figure 2 An introduction to dataset sampling used in the invention

-2/4- 09 Nov 2020 2020103315

Figure 3 A schematic diagram of the single character classification model of the

invention

Figure 4 An example schematic diagram of the detection result of the invention

-3/4- 09 Nov 2020 2020103315

Figure 5 An example schematic diagram of a layout extraction result of the present

invention

Figure 6 An example schematic diagram of a structured output result of the present

invention

-4/4- 09 Nov 2020 2020103315

Figure 7 An example schematic diagram of the final result obtained by digitizing writings

in antiquity in the present invention

Figure 8 A partial enlarged view of the picture labelled C in Figure 2