CN113516041A - Tibetan ancient book document image layout segmentation and identification method and system - Google Patents

Tibetan ancient book document image layout segmentation and identification method and system Download PDF

Info

Publication number
CN113516041A
CN113516041A CN202110526750.2A CN202110526750A CN113516041A CN 113516041 A CN113516041 A CN 113516041A CN 202110526750 A CN202110526750 A CN 202110526750A CN 113516041 A CN113516041 A CN 113516041A
Authority
CN
China
Prior art keywords
tibetan
text
ancient
segmentation
ancient book
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110526750.2A
Other languages
Chinese (zh)
Inventor
王维兰
陈园园
王筱娟
郝玉胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Minzu University
Original Assignee
Northwest Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Minzu University filed Critical Northwest Minzu University
Priority to CN202110526750.2A priority Critical patent/CN113516041A/en
Publication of CN113516041A publication Critical patent/CN113516041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to a method and a system for page segmentation and identification of ancient Tibetan book document images, wherein the method comprises the following steps: constructing a page segmentation data set of the ancient Tibetan book document image; training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set; segmenting the Tibetan ancient book image layout based on the trained deep convolution neural network; and identifying the text in the segmented ancient Tibetan book image layout. The method can realize accurate segmentation and character recognition of the ancient book document image of the Tibetan.

Description

Tibetan ancient book document image layout segmentation and identification method and system
Technical Field
The invention relates to the field of character recognition, in particular to a method and a system for dividing and recognizing the layout of an image of an ancient Tibetan book document.
Background
In the past decades, document image layout analysis has been performed in multiple languages and fonts, and researchers at home and abroad have proposed many different layout analysis methods for printed or handwritten ancient book documents. The traditional method comprises the following aspects: 1) texture-based analysis, 2) run-length smoothing algorithm, 3) projection contour cutting algorithm, 4) blank area analysis method, 5) connected area analysis method, and 6) Voronoi diagram analysis method. In the aspect of a deep learning method, the method starts from image pixels, utilizes a Convolutional Neural Network (CNN) to generate multilayer features of an image, utilizes the extracted features to build a proper model structure, selects a corresponding loss function, and learns parameters in the model by optimizing the loss function under a large amount of supervision data. Furthermore, in terms of layout analysis system research, s.pletschacher is equal to the 2010 release of a framework for page analysis and formatting of page basic elements, followed by the release of the layout analysis system alethia, and the continued expansion of supported languages. Subachai Tangwongsan et al constructs an efficient document page layout extraction system.
For the Tibetan document image layout analysis technology, only a few researchers at home and abroad have made some researches on the Tibetan ancient book image layout analysis. Among them, Ma et al developed a framework for image segmentation and recognition of Tibetan history documents. A layout segmentation method based on block projection is provided, a Tibetan document image is segmented into a text, a line and a frame, and the problem of adhesion between the text and the frame is solved by utilizing a text line segmentation method based on a graph model. Liu et al propose a layout analysis method for Tibetan history documents based on boundary information, which adopts a series of processing such as median filtering, Gaussian smoothing, Sobel edge detection and edge smoothing, small region removal, boundary position acquisition and the like, determines the position of each region, such as a text region, a left annotation, a right annotation and the like, according to the position relationship between the boundary and the region, and finally saves a document image in the format of XML page information. Zhang et al propose a historical Tibetan document image text extraction method based on connected component analysis and corner detection, which divides the document region of Tibetan historical ancient books into three categories by using associated components, equally divides the image into grids, filters the grids by using connected domain classification information and corner density information, calculates vertical and horizontal grid projections, can detect the approximate position of the text region through projection analysis, and accurately extracts the text region by correcting the bounding box of the approximate text region. Duan et al provides a historical Tibetan document image text extraction method based on block projection, which averagely blocks an image, filters the image according to the category of connected components and corner density information, finds an approximate text region through block projection analysis, and extracts the text region. The research solves the problem of layout segmentation of partial ancient book document images of the Tibetan by using the traditional method, and obtains good effect. However, the traditional segmentation method for the image layout of the ancient book document in the specific Tibetan language does not have good robustness, and is not easy to migrate when encountering other types of layouts.
Because of the inherent characteristics of ancient books of Tibetan, the condition that can have the adhesion between text and frame, text and the figure usually takes place, and the page layout is also comparatively complicated, and the page includes text block, image, frame, left and right titles etc. ancient books image color is inconsistent, the noise is many, and the different frame lines in the ancient books of Tibetan often can appear bending, slope, fracture to and the condition such as frame line and characters adhesion. The characteristics bring great challenges to the realization of the layout segmentation and description of the high-performance Tibetan ancient book images. The existing document layout analysis method mainly has the following defects: 1) most of the methods are used for analyzing the layout of a printed book with a modern comparison rule, and the methods are not suitable for historical documents with complex layouts; 2) most of the existing methods for analyzing the layout of historical documents are provided for the characteristics of the historical documents of a certain language and are not completely suitable for ancient books of Tibetan.
The invention aims to solve the problems of page segmentation and identification of ancient Tibetan book document images by utilizing a mixed strategy of a traditional method and a deep learning method.
Disclosure of Invention
The invention aims to provide a method and a system for page segmentation and recognition of ancient Tibetan book document images, which improve segmentation precision.
In order to achieve the purpose, the invention provides the following scheme:
a Tibetan ancient book document image layout segmentation and identification method comprises the following steps:
constructing a page segmentation data set of the ancient Tibetan book document image;
training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set;
segmenting the Tibetan ancient book image layout based on the trained deep convolution neural network;
and identifying the text in the segmented ancient Tibetan book image layout.
Optionally, the constructing the layout segmentation data set of the ancient Tibetan book document image specifically includes:
acquiring an ancient book image of the Tibetan;
preprocessing the ancient book images of the Tibetan;
carrying out data marking on the preprocessed ancient Tibetan book image to obtain a layout element type; the layout element types include: background, text, left title, right title, and figure;
and expanding the layout element types and generating labels to obtain a layout segmentation data set of the ancient book document image of the Tibetan language.
Optionally, the segmenting the ancient book image layout of the Tibetan language based on the trained deep convolutional neural network specifically includes:
carrying out uneven illumination processing on the ancient Tibetan book document image to be segmented;
carrying out image size normalization processing on the ancient book document image of the Tibetan after uneven illumination processing;
carrying out image slicing on the Tibetan ancient book document image with the normalized size;
inputting the Tibetan ancient book document images after image slicing into the trained deep convolutional neural network respectively to obtain a plurality of prediction results;
merging the plurality of prediction results to obtain a segmentation result of the whole ancient book image of the Tibetan;
and restoring the segmentation result to the original size.
Optionally, the identifying the text in the segmented Tibetan ancient book image layout specifically includes:
and identifying the left title, the body and the right title of the text in the segmented ancient Tibetan book image layout.
Optionally, the identifying and collecting the left title, the body and the right title of the text in the segmented Tibetan ancient book image layout comprises:
constructing a text line dataset of ancient Tibetan books; the Tibetan ancient book text line dataset comprises: the method comprises the steps of (1) combining a Tibetan ancient book text line into a data set and a Tibetan ancient book text line real data set;
training the CRNN neural network based on the Tibetan ancient book text line data set;
identifying a left title and a text in the ancient book image of the Tibetan on the basis of the trained CRNN neural network;
and recognizing the right title in the ancient Tibetan book image by adopting a Chinese OCR interface.
Optionally, the constructing the ancient book text line dataset of the Tibetan specifically includes:
constructing a Tibetan ancient book text line synthesis data set;
and constructing a real data set of the text line of the ancient Tibetan books.
Optionally, the constructing of the Tibetan ancient book and text line composition data set specifically includes:
obtaining a corpus;
filtering the corpus;
synthesizing text lines based on the filtered corpora;
and generating labels and dictionaries based on the text lines to obtain a text line synthesis data set of the ancient books of the Tibetan language.
Optionally, the constructing of the real data set of the text line of the ancient Tibetan book specifically includes:
acquiring and marking a complete text line;
roughly dividing the complete text line to obtain a shorter text image segment;
marking the text image segment;
and generating a label and a dictionary based on the marked text image segment to obtain a text line real data set of the ancient book of the Tibetan language.
Optionally, the identifying the left title and the text in the ancient book image of the Tibetan language based on the trained CRNN neural network specifically includes:
segmenting a left title and a text in the ancient book document image of the Tibetan;
carrying out binarization processing on the left title and the text;
segmenting the left title and the text after the binarization processing;
carrying out text line segmentation on the left title and the text after line segmentation;
inputting the left title and the text after the text line segmentation into a trained CRNN network for recognition to obtain a corresponding inner code text in the text image segment;
and combining the corresponding inner code texts in the text image segments to obtain the recognition result of the whole left title and the text.
The invention also provides a system for page segmentation and identification of ancient book document images of Tibetan, which comprises:
the Tibetan ancient book document image layout segmentation data set construction module is used for constructing a Tibetan ancient book document image layout segmentation data set;
the deep convolutional neural network training module is used for training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set;
the segmentation module is used for segmenting the ancient book image layout of the Tibetan based on the trained deep convolutional neural network;
and the recognition module is used for recognizing the text in the segmented Tibetan ancient book image layout.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the method and the system of the invention use the currently popular semantic segmentation network deep Lab to segment the layout elements (text blocks, left titles, right titles and illustrations) in the ancient Tibetan book images, and the method can ignore the influence caused by poor border and background quality and obtain good segmentation effect. In addition, compared with the traditional segmentation method aiming at a specific layout, the deep learning segmentation method based on the big data has stronger robustness and better applicability under the condition of ensuring better accuracy;
the Tibetan ancient book image layout description method based on the data storage and data exchange language XML enables the results of Tibetan ancient book layout segmentation and layout element identification to be effectively stored, provides a data base for high-quality and retrievable layout restoration, defines a reasonable data structure for Tibetan ancient book image layout analysis data, constructs a general Tibetan ancient book layout analysis data template through the XML Schema, and can exchange and transmit on multiple platforms while ensuring that the Tibetan ancient book image layout description data can be efficiently stored.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic diagram illustrating a method for page segmentation and recognition of ancient Tibetan book document images according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a layout segmentation data set construction process for ancient Tibetan book document images according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the network structure of DeepLabv1 and v2 according to the embodiment of the present invention;
FIG. 4 is a flowchart illustrating the process of training the ancient Tibetan book image layout segmentation model according to the embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for dividing a page of an ancient Tibetan book image according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a segmentation result of the ancient Tibetan book image by page segmentation according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a step of constructing a text line dataset of ancient Tibetan books according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a CRNN structure according to an embodiment of the present invention;
FIG. 9 is a flow chart of the Tibetan ancient book line of text recognizer training according to the embodiment of the present invention;
FIG. 10 is a flow chart of left heading and text recognition according to an embodiment of the invention;
FIG. 11 is a flowchart of a routine segmentation method implemented in accordance with the present invention;
FIG. 12 is a diagram illustrating an embodiment of a data structure of ancient Tibetan book layout description;
FIG. 13(a) is an original image of an ancient Tibetan book document according to an embodiment of the present invention;
FIG. 13(b) is a schematic diagram of an image tag of an ancient Tibetan book document according to an embodiment of the present invention;
FIG. 13(c) is a diagram illustrating the prediction result of the image of the ancient Tibetan book document according to the embodiment of the present invention;
FIG. 13(d) is a diagram illustrating the segmentation result of the ancient book document image in Tibetan according to the embodiment of the present invention;
FIG. 14(a) is a schematic diagram of left-title, text and right-title segmentation according to an embodiment of the present invention;
FIG. 14(b) is a diagram illustrating the text recognition result according to the embodiment of the present invention;
FIG. 15 is a schematic diagram of a system for page segmentation and recognition of ancient Tibetan book document images according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for page segmentation and recognition of ancient Tibetan book document images, which improve segmentation precision.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic view of a method for page segmentation and recognition of ancient Tibetan book document images according to an embodiment of the present invention, where the method shown in fig. 1 includes:
step 101: and constructing a page segmentation data set of the ancient Tibetan book document image.
As shown in fig. 2, fig. 2 shows the process of constructing the layout segmentation data set of the ancient document image of the Tibetan:
step 1: and (6) selecting data. The layout data of the ancient Tibetan books come from the great Tibetan of Beijing edition and Lijiang edition Ganzhu, and pages of the books are randomly selected for preprocessing and marking to construct a layout segmentation data set of the ancient Tibetan book images. And finally obtaining 310 ancient book images which comprise 5 layouts, namely a Beijing edition home page, a Beijing edition text, a Beijing edition tail page, a Lijiang edition home page and a Lijiang edition text. Wherein the Beijing edition has 50 pages, 70 pages of text, 50 pages of tail, 66 pages of Lijiang edition and 74 pages of text.
Step 2: and (4) preprocessing data. The method mainly processes uneven illumination of ancient book images of Tibetan in Ganzhu edition, and performs adaptive brightness correction on the ancient book images based on a two-dimensional gamma function. Firstly, converting the color space of the ancient book image from RGB into HSV, weighting the illumination component V of the ancient book image by adopting Gaussian functions with different scales to obtain an estimated value of the illumination component, and calculating the estimated value in the formula (1).
Figure BDA0003065811830000071
Where F (x, y) is the luminance component of the HSV color space of the input image, Gi(x, y) the ith scale Gaussian function, where I (x, y) is the illumination component value extracted and weighted by a plurality of Gaussian functions of different scales at the (x, y) point, wiWeighting coefficients of the illumination components extracted for the ith scale gaussian function, i ═ 1,2, 3. In consideration of the balance between the accuracy of the extraction of the illumination component and the operation amount, N is taken to be 3, that is, the values of the illumination component are extracted by using gaussian functions of 3 different scales. Parameters of the two-dimensional gamma function are adaptively adjusted by using the distribution characteristics of the illumination components of the image, and the expression of the adaptive two-dimensional gamma function is shown in formula (2).
Figure BDA0003065811830000072
Where O (x, y) is the corrected luminance component, γ is an index value for luminance enhancement, which contains the illumination component characteristics of the image, and m is the luminance average of the illumination components.
Step 3: and (6) marking data. According to the layout structure of the selected data, the layout elements are divided into 5 types: background (background), text (text), left caption (lefttitle), right caption (righttitle), and figure (figure), and the target region is marked out in a rectangular frame using a marker labelme.
Step 4: tag generation and sample expansion. Tags are generated with reference to the format of the paschaloc 2012 data set. The pascalloc 2012 data set contains three major classes of classification, detection and segmentation, wherein the data set for semantic segmentation contains 12380 pictures, which include 20 classes of objects, the labels are from 1 to 20 (the background is 21 classes, the label of the background is 0), and the mask color can be randomly assigned. And converting the labelline format label file marked manually in Step 2 into a label file in a PASCALVOC 2012 data set format. According to the type of the layout elements of the ancient book image in Tibetan, the label of the background is set to be 0, the color of the mask is set to be (0,0,0), the label of the text is set to be 1, the color of the mask is set to be (128,0,0), the label of the left title is set to be 2, the color of the mask is set to be (0,128,0), the label of the right title is set to be 3, the color of the mask is set to be (128, 0), the label of the figure is set to be 4, and the color of the mask is set to be (0,0,128), as shown in table 1.
TABLE 1 labels and their corresponding mask colors
Figure BDA0003065811830000081
Because the ancient book image of Tibetan is oversize, most of the sizes are: beijing version 5000 x 3000, Lijiang version 1200 x 300, can't be sent to the network once and trained, and the mark sample is only 310. Therefore, the size of the picture is reduced and the data set is enlarged by adopting a sliding window slicing mode, and finally, 11440 sample pictures with the size of 321 × 321 are obtained, and the data set is named as THDID-LS (very high temporal clustering images data set-Layout Segmentation).
Step 102: and training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set.
Fig. 3 shows the structure of the deep lab network, where input is a three-channel or single-channel image to be segmented, the Coarse segmentation result is up-sampled by deep convolutional neural network Atrous conversion (deep features are extracted to obtain a Coarse segmentation result, i.e., Coarse Score map, Bi-linear interpolation is performed, and then a segmentation result, i.e., Final Output, is obtained by Fully Connected conditional random field Connected CRF.
Based on a network deep Lab, a segmentation model is trained on a data set THDID-LS, and the layout segmentation of the ancient book document image of the Tibetan is realized. The Tibetan ancient book image layout segmentation model training flow chart is shown in fig. 4.
Step 1: initialize the network and set the hyper-parameters. Initializing a network structure, including the selection of an optimizer, the definition of a loss function and the like, and setting hyper-parameters required by the network, such as learning rate, training round number and the like.
Step 2: the data set is loaded. And reading the data set from the memory according to the path, namely reading the page segmentation data set of the ancient book document image of the Tibetan language.
Step 3: and (5) training a model. The pictures are sent to a network for forward propagation to obtain actual output, the loss between the pictures and the label images is calculated through a loss function, and the network parameters are continuously adjusted by using an optimizer, so that the error between the actual output of the next round of training and the label images is smaller.
Step 4: and saving the model. And after the training is finished, storing the trained model and parameters into a model file.
After training, the segmentation accuracy of the Tibetan ancient book document image layout segmentation model trained on the basis of the semantic segmentation network deep Lab in the data set THDID-LS is 90.3%.
Step 103: and segmenting the Tibetan ancient book image layout based on the trained deep convolution neural network.
After the network finishes the training of the segmentation model, the model can be used to obtain the page segmentation result of the ancient book image of the Tibetan language. Fig. 5 shows a flow chart of the ancient book image layout segmentation method of the Tibetan language.
Step 1: and (5) processing uneven illumination. Preprocessing the ancient book document image of the Tibetan to be segmented and correcting the influence of uneven illumination.
Step 2: and (5) normalizing the image size. Since the input image size requirement of the segmentation model is 321 × 321, the ancient Tibetan image size is normalized to (321 × M) × (321 × N).
Step 3: and (4) slicing the image. And cutting the normalized ancient Tibetan book image into X sub-images with the size of 321 × 321, wherein X is M × N.
Step 4: and (5) dividing and predicting. And respectively performing layout segmentation prediction on the X sub-images after slicing by calling a segmentation model trained by a deep Lab network to obtain a prediction result.
Step 5: and merging the prediction results. And combining the segmentation results of the X sub-images into the segmentation result of the whole ancient book document image of the Tibetan.
Step 6: and restoring the original size of the image. And restoring the segmentation result to the original size.
As shown in fig. 6, the example given is a segmentation result taken from page 002.15 of the great Tibetan language of the beijing edition ganzuel.
Step 104: and identifying the text in the segmented ancient Tibetan book image layout.
After the division, the text portion in the layout needs to be identified. The identification problem of the present invention can be divided into 2 types: a) tibetan text recognition, including left header and body part, such as left and middle part in FIG. 6; b) traditional and simplified Chinese character recognition, wherein the right title is the mixed arrangement of the traditional and simplified Chinese characters, as shown in the right part of FIG. 6. For the class a identification problem, Tibetan text line identification is carried out by adopting an end-to-end method, namely a text-to-text method, firstly, binarization is carried out on a segmented text block image based on an improved U-shaped full convolution neural network method, then text line segmentation is carried out by combining a character core region and an extended growth, segmented text lines are segmented by adopting a run-length smoothing algorithm, and finally, end-to-end Tibetan text line identification is carried out by utilizing a CRNN + CTC. And for the b-type problems, calling a Chinese open OCR model to obtain a recognition result. And finally, combining the obtained text recognition result with the segmented position information to prepare for next layout description.
Specifically, the step of identifying the text in the segmented Tibetan ancient book image layout specifically includes:
and identifying the left title, the body and the right title of the text in the segmented ancient Tibetan book image layout.
The method specifically comprises the following steps of identifying the left title and the body of the text in the segmented ancient Tibetan book image layout:
step 201: constructing a text line dataset of ancient Tibetan books; the Tibetan ancient book text line dataset comprises: the Tibetan ancient book text line synthesis data set and the Tibetan ancient book text line real data set.
And constructing a Tibetan ancient book document image text line synthesis data set and a Tibetan ancient book document image text line real data set to train and test the recognizer for recognizing the left title and the text. The Tibetan ancient book and text line synthesis data set is characterized in that under the condition that real data are not easy to obtain, a large amount of simulation data are obtained through a synthesis method so as to train a recognizer. The real data set of the Tibetan ancient book text line is real data marked by a small part of manpower, on one hand, the effect of the recognizer trained based on the Tibetan book line synthesis data set is verified, on the other hand, the real Tibetan ancient book text line data is added when the recognizer is trained for two times in an iteration mode, and the generalization of the recognition model is improved. The steps of constructing the data set are shown in fig. 7.
The construction steps of the Tibetan ancient book and text line synthesis data set are as follows:
step 1: and preparing the corpus. The "Ganzuer" great Tibetan has versions of Beijing edition, Lijiang edition, and Lasa edition, the content of the books of each version is basically the same, the Lijiang edition has been finished with the interest of researchers to disclose the text content of a complete 108-letter book, and a total of 3398 text files are selected as the corpus of the line data synthesis of the Tibetan text. However, since the organized text file contains Chinese titles, page numbers, etc., Chinese characters, English characters, and numeric symbols need to be filtered from the text file. The basic method of filtration is: the Unicode encoding range for Tibetan characters is 0F00 to 0FFF, only characters within this range are retained, and the rest are filtered. Finally, a corpus file with the size of 283.75M is obtained.
Step 2: and synthesizing text line data. The ancient culture of the Tibetan studied is Ujin, so the fonts to be synthesized are selected from the most popular types of Ujin, including Qomolangma-uchensurring, Qomolangma-uchensutung, Qomolangma-uchensarchung, Qomolangma-uchensarchenchen and ctrc-uchen. Meanwhile, in order to enable the text line of the synthesized document image to be close to the real text line data as much as possible, random stains and noise points are added on the synthesized image in a sampling mode on a real sample to simulate stains and noise points in real data, random inclination is set to simulate the inclination of the text line, and random Gaussian noise is added on the synthesized image to simulate the blurring of the real data caused by collection or fading. The synthesized Tibetan document image text line Data Set is named TTLDS-G (TibetanTextLine Data Set-Generation).
Step 3: and generating a label and a dictionary. A tag file in the format of a single text line of "image path image corresponding text" is generated with reference to the document image text identification data set ICDAR 2013 or MJSynth. And extracting all Tibetan characters from the tag file, and performing deduplication processing to obtain 4851 types of Tibetan characters in total. A dictionary file formatted as a line of a type of Tibetan character is generated and the last line of text of the dictionary is assigned as "< BLK >", for a total of 4852 lines.
Specifically, the construction steps of the image text line real data set of the ancient Tibetan book document are as follows:
step 1: complete lines of text are acquired and marked. The text lines of the marked original Tibetan document image are derived from 212 selected Beijing version Ganzhu Dazang Jing images, the total lines are 1696, and the corresponding inner code Tibetan text of each line is marked.
Step 2: and carrying out rough segmentation on text lines of the document image. And roughly segmenting the text line of the image of the ancient book document of the Tibetan language by using a character segmentation method based on the structural attribute to obtain a shorter text image segment.
Step 3: text image segment markers. Marking the rough segmentation result, and generating a label file with a single text line format of 'text corresponding to image path image'.
Step 4: and synthesizing the data set with the text line of the ancient book of the Tibetan language at Step 3 to obtain a dictionary file. The Real Tibetan document image Text Line Data Set is named as TTLDS-R (Tibet Text Line Data Set-Real).
Step 202: and training the CRNN neural network based on the Tibetan ancient book text line data set.
CRNN is the structure of CNN + RNN + CTC, an end-to-end text recognition network proposed in 2017 by Shi et al, which considers document image text recognition as a prediction of word sequences, so LSTM (long short term memory) in sequence prediction network RNN (recurrent neural network) is used. Firstly, the characteristics of the picture are extracted through a CNN (convolutional neural network), then the sequence is predicted through an RNN, and finally, the input sequence and the output sequence are in one-to-one correspondence through a CTC (connecting terminal hierarchical classification) to obtain a final recognition result. The architecture of CRNN is divided into three parts in the text: 1) convolutional Layers (Convolutional Layers) for extracting a feature sequence from an input image; 2) a loop layer (Current Layers) for predicting the label distribution of each frame; 3) the Transcription Layers (Transcription Layers) convert each frame prediction into the final tag sequence. CRNN Structure schematic FIG. 8 shows the input (input) of document image segment to the in-code text
Figure BDA0003065811830000121
The process of (1).
And training a recognizer on the data set TTLDS-G based on the network CRNN for recognizing the left title and the text in the image of the ancient book document of the Tibetan language. The Tibetan ancient book document image text line recognizer training flow chart is shown in FIG. 9.
The Tibetan ancient book CRNN recognizer training steps are as follows:
step 1: initialize the network and set the hyper-parameters. Initializing a network structure, including the selection of an optimizer, the definition of a loss function and the like, and setting hyper-parameters required by the network, such as learning rate, training round number and the like.
Step 2: the data set is loaded. And reading the data set from the memory according to the path, namely reading the ancient book data set of the Tibetan language.
Step 3: and (5) training a model. Sending the pictures into a network for forward propagation to obtain actual output, calculating the loss between the pictures and the label through a loss function, and continuously adjusting network parameters by using an optimizer to ensure that the error between the actual output of the next round of training and the label is smaller.
Step 4: and saving the model. And after the training is finished, storing the trained model and parameters into a model file.
After training, the recognition rate of the ancient Tibetan book text recognizer trained on the CRNN network on the data set TTLDS-G is 98.02%, and the recognition rate on the real data set TTLDS-R is 85.41%.
Step 203: and identifying the left title and the text in the ancient image of the Tibetan on the basis of the trained CRNN neural network.
And after the network finishes the training of the recognizer, the CRNN recognizer of the ancient book of the Tibetan language is obtained, and the left title and the text segmented in the image of the ancient book of the Tibetan language can be recognized by the recognizer. Fig. 10 shows a flow chart of the left heading and body recognition in the image of the ancient book document of the Tibetan language.
The text block image binarization, the text block image line segmentation, the text line image segmentation, the identification and the merging identification result are respectively explained as follows.
a. Binarization method
Since the line segmentation is performed on the binarized image, the left caption and the text are first binarized. The traditional binarization method basically processes based on a threshold (global threshold and local threshold), common binarization methods include a global mean threshold method, an Otsu method (OSTU), a Bernsen algorithm (Bernsen) and the like, but ancient Tibetan books are eroded by years, characters are faded, pages are yellow, stains are serious, and the problems cannot be well solved by a simple binarization method. The invention adopts a Tibetan ancient book binarization method based on an improved U-shaped full convolution neural network, and specially processes the Tibetan ancient book images aiming at the conditions of uneven illumination, blurring, stains, pseudo-adhesion and the like.
b. Text block line segmentation
And adopting a method combining a character core area and expansion growth to cut lines of the left title and the text. Firstly, non-sound nodes are removed according to the area and the circularity of a connected domain in a binary Tibetan ancient book document image, and a sound node image is obtained. Secondly, horizontally projecting the audio-video node images and vertically projecting the binary original images to obtain the range of the text line base line and the text line number and generate a character core area; and obtaining a pseudo text connected region by carrying out pixel point or operation on the character core region and the binary original image. And finally, expanding and growing the character core region to a pseudo text connected region based on an breadth-first algorithm to obtain a pseudo text line connected region, removing a non-character region to obtain a pseudo text line, and obtaining a final text line by an effective breaking stroke line attribution method. Fig. 11 shows the processing procedure of the above method.
c. Text line segmentation
And (c) segmenting the result of line segmentation in the b to shorten the length of a text line in the document image, so that the method is suitable for the trained end-to-end recognizer for the text line of the CRNN Tibetan document image. The method comprises the steps of carrying out run length smoothing and connected domain attribution processing on each line of separated document images, and then taking out each connected domain as a segmentation result, namely a text image segment.
d. Identification
And calling the Tibetan ancient book CRNN recognizer to respectively recognize the text image segments to obtain recognition results of the text image segments, namely the internal code texts corresponding to the text image segments.
e. Merging recognition results
And combining the recognition results of the text image segments into the recognition results of the whole left title and the text.
The recognition result is obtained by using Baidu open OCR for right title recognition. The general character recognition platform of the Baidu company is based on the advanced deep learning technology in the industry, provides multi-scene and high-precision whole image-text character detection and recognition service, and multiple indexes are at the top of the world. The invention uses the universal character recognition service interface containing the position information provided by the platform to complete the recognition of the right heading Chinese in the Tibetan ancient book image layout.
Step 105: and (6) layout description.
Through the analysis of the above layout segmentation and recognition results, the basic information contained in the layout segmentation and recognition results is mainly layout element category information, layout element position information and layout element content information, and after the layout segmentation and recognition results are integrated, the layout segmentation and recognition results can be divided into two categories: a) a character area: for three types of layout elements, namely, a text, a left title and a right title, the described information comprises a category, a position and an identification result; b) image area: for the layout elements of the image type, the layout elements are firstly stored in a unified picture format, and then the storage path, the category and the position information of the layout elements are described.
The layout description data is classified into 3 types: creating information, image information, and layout information. Wherein the creation information comprises a creator, a creation time and a last modification time; the image information includes an image name, a width and a height of the image; the layout information includes detailed information of each region in the layout. The ancient book layout description data structure of Tibetan is shown in fig. 12.
According to the data structure of FIG. 12, an XML Schema is constructed by using an XML editing tool Altova XML Spy to describe the structure, and the structure is used as a description template. The ancient Tibetan book layout description data is named Schema _ thld according to the XML Schema constructed in FIG. 12, the layout element types in the file are identified by numbers 1,2,3 and the like, 1 represents a text type, 2 represents a left title, 3 represents a right title, and 4 represents a diagram.
And generating specific layout description files of all the following ancient Tibetan book layout descriptions by adopting extensible markup language XML according to rules in Schema _ THLD.
As shown in fig. 13(a) -13 (d), and fig. 14(a) and 14(b), fig. 13(a) is an original image of a Tibetan ancient book document according to an embodiment of the present invention, fig. 13(b) is a schematic diagram of a label of a Tibetan ancient book document according to an embodiment of the present invention, fig. 13(c) is a schematic diagram of a prediction result of a Tibetan ancient book document image according to an embodiment of the present invention, fig. 13(d) is a schematic diagram of a segmentation result of a Tibetan ancient book document image according to an embodiment of the present invention, fig. 14(a) is a schematic diagram of a segmentation of a left header, a body and a right header according to an embodiment of the present invention, and fig. 14(b) is a schematic diagram of a text recognition result according to an embodiment of the present invention.
Fig. 15 is a schematic structural diagram of a system for page segmentation and recognition of ancient Tibetan book document images according to an embodiment of the present invention, as shown in fig. 15, the system includes:
the Tibetan ancient book document image layout segmentation data set construction module 301 is used for constructing a Tibetan ancient book document image layout segmentation data set;
a deep convolutional neural network training module 302, which trains a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set;
the segmentation module 303 is used for segmenting the page of the Tibetan ancient book image based on the trained deep convolutional neural network;
and the identification module 304 is used for identifying the text in the segmented Tibetan ancient book image layout.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A Tibetan ancient book document image layout segmentation and identification method is characterized by comprising the following steps:
constructing a page segmentation data set of the ancient Tibetan book document image;
training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set;
segmenting the Tibetan ancient book image layout based on the trained deep convolution neural network;
and identifying the text in the segmented ancient Tibetan book image layout.
2. The Tibetan ancient book document image layout segmentation and identification method according to claim 1, wherein the constructing of the Tibetan ancient book document image layout segmentation dataset specifically comprises:
acquiring an ancient book image of the Tibetan;
preprocessing the ancient book images of the Tibetan;
carrying out data marking on the preprocessed ancient Tibetan book image to obtain a layout element type; the layout element types include: background, text, left title, right title, and figure;
and expanding the layout element types and generating labels to obtain a layout segmentation data set of the ancient book document image of the Tibetan language.
3. The Tibetan ancient book document image layout segmentation and identification method according to claim 1, wherein the segmentation of the Tibetan ancient book image layout based on the trained deep convolutional neural network specifically comprises:
carrying out uneven illumination processing on the ancient Tibetan book document image to be segmented;
carrying out image size normalization processing on the ancient book document image of the Tibetan after uneven illumination processing;
carrying out image slicing on the Tibetan ancient book document image with the normalized size;
inputting the Tibetan ancient book document images after image slicing into the trained deep convolutional neural network respectively to obtain a plurality of prediction results;
merging the plurality of prediction results to obtain a segmentation result of the whole ancient book image of the Tibetan;
and restoring the segmentation result to the original size.
4. The Tibetan ancient book document image layout segmentation and identification method according to claim 1, wherein the identification of the text in the segmented Tibetan ancient book image layout specifically comprises:
and identifying the left title, the body and the right title of the text in the segmented ancient Tibetan book image layout.
5. The Tibetan ancient book document image layout segmentation and identification method as claimed in claim 4, wherein the step of identifying and collecting the left title, the body and the right title of the text in the segmented Tibetan ancient book image layout comprises the following steps:
constructing a text line dataset of ancient Tibetan books; the Tibetan ancient book text line dataset comprises: the method comprises the steps of (1) combining a Tibetan ancient book text line into a data set and a Tibetan ancient book text line real data set;
training the CRNN neural network based on the Tibetan ancient book text line data set;
identifying a left title and a text in the ancient book image of the Tibetan on the basis of the trained CRNN neural network;
and recognizing the right title in the ancient Tibetan book image by adopting a Chinese OCR interface.
6. The Tibetan ancient book document image layout segmentation and identification method according to claim 5, wherein the step of constructing the Tibetan ancient book text line data set specifically comprises the steps of:
constructing a Tibetan ancient book text line synthesis data set;
and constructing a real data set of the text line of the ancient Tibetan books.
7. The Tibetan ancient book document image layout segmentation and identification method as claimed in claim 6, wherein the construction of the Tibetan ancient book text line composition dataset specifically comprises:
obtaining a corpus;
filtering the corpus;
synthesizing text lines based on the filtered corpora;
and generating labels and dictionaries based on the text lines to obtain a text line synthesis data set of the ancient books of the Tibetan language.
8. The Tibetan ancient book document image layout segmentation and identification method according to claim 6, wherein the step of constructing the Tibetan ancient book text line real data set specifically comprises the steps of:
acquiring and marking a complete text line;
roughly dividing the complete text line to obtain a shorter text image segment;
marking the text image segment;
and generating a label and a dictionary based on the marked text image segment to obtain a text line real data set of the ancient book of the Tibetan language.
9. The Tibetan ancient book document image layout segmentation and identification method according to claim 5, wherein the identification of the left title and the text in the Tibetan ancient book image based on the trained CRNN neural network specifically comprises:
segmenting a left title and a text in the ancient book document image of the Tibetan;
carrying out binarization processing on the left title and the text;
segmenting the left title and the text after the binarization processing;
carrying out text line segmentation on the left title and the text after line segmentation;
inputting the left title and the text after the text line segmentation into a trained CRNN network for recognition to obtain a corresponding inner code text in the text image segment;
and combining the corresponding inner code texts in the text image segments to obtain the recognition result of the whole left title and the text.
10. A Tibetan ancient book document image layout segmentation and identification system is characterized by comprising:
the Tibetan ancient book document image layout segmentation data set construction module is used for constructing a Tibetan ancient book document image layout segmentation data set;
the deep convolutional neural network training module is used for training a deep convolutional neural network based on the Tibetan ancient book document image layout segmentation data set;
the segmentation module is used for segmenting the ancient book image layout of the Tibetan based on the trained deep convolutional neural network;
and the recognition module is used for recognizing the text in the segmented Tibetan ancient book image layout.
CN202110526750.2A 2021-05-14 2021-05-14 Tibetan ancient book document image layout segmentation and identification method and system Pending CN113516041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110526750.2A CN113516041A (en) 2021-05-14 2021-05-14 Tibetan ancient book document image layout segmentation and identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110526750.2A CN113516041A (en) 2021-05-14 2021-05-14 Tibetan ancient book document image layout segmentation and identification method and system

Publications (1)

Publication Number Publication Date
CN113516041A true CN113516041A (en) 2021-10-19

Family

ID=78064325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110526750.2A Pending CN113516041A (en) 2021-05-14 2021-05-14 Tibetan ancient book document image layout segmentation and identification method and system

Country Status (1)

Country Link
CN (1) CN113516041A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422125A (en) * 2022-09-29 2022-12-02 浙江星汉信息技术股份有限公司 Electronic document automatic filing method and system based on intelligent algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561928A (en) * 2020-12-10 2021-03-26 西藏大学 Layout analysis method and system for ancient Tibetan books
CN112633431A (en) * 2020-12-31 2021-04-09 西北民族大学 Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561928A (en) * 2020-12-10 2021-03-26 西藏大学 Layout analysis method and system for ancient Tibetan books
CN112633431A (en) * 2020-12-31 2021-04-09 西北民族大学 Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YIXIN LI等: "DeepLayout:A Semamtic Segmentation Approach to Page Layout Analysis", 《INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING ICIC2018: INTELLIGENT COMPUTING METHODOLOGIES 》 *
吴燕如: "藏文现代印刷物版面检测技术研究", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 *
张西群: "面向藏文历史文献的版面分割方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈园园等: "基于自适应游程平滑算法的藏文文档图像版面分割与描述", 《激光与光电子学进展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422125A (en) * 2022-09-29 2022-12-02 浙江星汉信息技术股份有限公司 Electronic document automatic filing method and system based on intelligent algorithm
CN115422125B (en) * 2022-09-29 2023-05-19 浙江星汉信息技术股份有限公司 Electronic document automatic archiving method and system based on intelligent algorithm

Similar Documents

Publication Publication Date Title
US8908961B2 (en) System and methods for arabic text recognition based on effective arabic text feature extraction
CN103942550B (en) A kind of scene text recognition methods based on sparse coding feature
CN113158808B (en) Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN114005123B (en) Digital reconstruction system and method for printed text layout
CN113537227B (en) Structured text recognition method and system
Elagouni et al. Combining multi-scale character recognition and linguistic knowledge for natural scene text OCR
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN112818951A (en) Ticket identification method
CN113158977B (en) Image character editing method for improving FANnet generation network
Chamchong et al. Character segmentation from ancient palm leaf manuscripts in Thailand
US20210056429A1 (en) Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
CN113221711A (en) Information extraction method and device
WO2024041032A1 (en) Method and device for generating editable document based on non-editable graphics-text image
CN111680684B (en) Spine text recognition method, device and storage medium based on deep learning
CN114119949A (en) Method and system for generating enhanced text synthetic image
Ghosh et al. R-PHOC: segmentation-free word spotting using CNN
CN113516041A (en) Tibetan ancient book document image layout segmentation and identification method and system
Ashraf et al. An analysis of optical character recognition (ocr) methods
CN115311666A (en) Image-text recognition method and device, computer equipment and storage medium
CN112329389B (en) Chinese character stroke automatic extraction method based on semantic segmentation and tabu search
Shafait Geometric Layout Analysis of scanned documents
Al-Barhamtoshy et al. Arabic OCR segmented-based system
Chen et al. Scene text recognition based on deep learning: a brief survey
Tzogka et al. OCR Workflow: Facing Printed Texts of Ancient, Medieval and Modern Greek Literature.
CN116311275B (en) Text recognition method and system based on seq2seq language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination