CN112329548A - Document chapter segmentation method and device and storage medium - Google Patents

Document chapter segmentation method and device and storage medium Download PDF

Info

Publication number
CN112329548A
CN112329548A CN202011106303.3A CN202011106303A CN112329548A CN 112329548 A CN112329548 A CN 112329548A CN 202011106303 A CN202011106303 A CN 202011106303A CN 112329548 A CN112329548 A CN 112329548A
Authority
CN
China
Prior art keywords
information
chapter
picture
electronic document
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011106303.3A
Other languages
Chinese (zh)
Inventor
薛晗庆
潘红九
李昊星
陈超
窦小明
施卫科
雷净
李萌萌
杨飞
尹琼
底亚峰
皮彬睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Near Space Vehicles System Engineering
Original Assignee
Beijing Institute of Near Space Vehicles System Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Near Space Vehicles System Engineering filed Critical Beijing Institute of Near Space Vehicles System Engineering
Priority to CN202011106303.3A priority Critical patent/CN112329548A/en
Publication of CN112329548A publication Critical patent/CN112329548A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a document chapter division method, a document chapter division device and a storage medium, which are used for improving the speed and the accuracy of chapter content division of a picture electronic document. The document chapter segmentation method provided by the application comprises the following steps: reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document; according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified; identifying the character information in the unit to be identified to obtain a text to be processed; performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result; and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information. The application also provides a document chapter segmentation device and a storage medium.

Description

Document chapter segmentation method and device and storage medium
Technical Field
The present application relates to the field of information processing, and in particular, to a method and an apparatus for segmenting a document chapter, and a storage medium.
Background
With the continuous development of information technology, the use of electronic books and documents is becoming more widespread and common. The picture electronic document refers to an electronic document which is stored in a picture format by means of shooting, scanning and the like on paper books, manuscripts, documents and the like. The picture electronic document is usually composed of a single picture, so that it is difficult for a user of the picture electronic document to know the overall structural distribution of the document, and especially, it is inconvenient to search the contents contained in each level of chapter title. This makes tasks distributed based on the chapter structure of the picture electronic document (such as chapter content classification, chapter content matching, etc.) difficult to handle. In order to obtain the overall structural distribution of the picture electronic document, the content of each level of the chapter title of the picture electronic document needs to be segmented. In the prior art, the content of the chapter header is determined and divided based on the sparseness degree of each black pixel line in the picture content, the accuracy rate of the division result is low, and the chapter header to which the division result belongs is manually confirmed after division, so that the efficiency is low.
Disclosure of Invention
In view of the foregoing technical problems, embodiments of the present application provide a document chapter division method, apparatus, and storage medium, which are used for speed and accuracy of chapter content division of an electronic document.
In a first aspect, a document chapter segmentation method provided in an embodiment of the present application includes:
reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document;
according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified;
identifying the character information in the unit to be identified to obtain a text to be processed;
performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result;
and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.
Further, the inputting the picture electronic document, the tree directory structure information and the in-page field information of the picture electronic document include:
reading the picture electronic document, and determining page number and page number information;
reading tree directory structure information corresponding to the picture electronic document, wherein the tree directory structure information comprises hierarchy information, and separators between chapter header information corresponding to the hierarchy information and chapter header information of the same level;
judging whether the separator is correct or not, and prompting to re-input the tree directory structure information if the separator is incorrect;
reading column information corresponding to the picture electronic document;
and judging whether the column information is correct or not, and prompting to re-input the column information if the column information is incorrect.
Further, the performing, according to the in-page column information, column segmentation on the picture electronic document to obtain a unit to be recognized includes:
for each picture in the picture electronic document, performing the following operations:
reading column information corresponding to a current picture;
and if the number of the columns of the column information is less than 1, not executing column segmentation, otherwise, executing column segmentation.
Preferably, the field segmentation includes:
carrying out image binarization processing on the current picture to obtain a first picture;
determining longitudinal black pixel distribution information and column number information in the first picture, and determining position information of a column symbol;
and segmenting the current picture according to the position information of the column to obtain a unit to be identified.
Further, the recognizing the text information in the unit to be recognized to obtain the text to be processed includes:
executing the following operations on all units to be identified of the picture electronic document:
identifying the text content in the unit to be identified, and determining the coordinate position of the text content;
storing the coordinate location and the text content.
Further, the performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining the chapter content according to the matching result includes:
judging whether the regular matching is successful, if so, recording chapter title information, the level of the chapter title and the corresponding page position;
recording the content of the chapter according to the chapter title information, the hierarchy of the chapter title and the corresponding page position, and carrying out segmentation operation on the picture corresponding to the chapter;
storing a segmentation result, wherein the segmentation result comprises chapter header information, the content of the chapter is extracted from the chapter set, the starting page number of the chapter and the ending page number of the chapter.
Preferably, the determining the chapter of the picture electronic document according to the chapter content and the tree directory structure information includes:
sorting all chapters in ascending order according to the page number;
and determining the contents of different chapters according to the tree directory structure information and the arrangement sequence of the initial page numbers.
By using the document chapter segmentation method provided by the invention, the text content information in the electronic picture document can be identified, and the text content information is regularly matched with the tree-shaped directory structure of the electronic picture document, so that the accuracy of chapter content segmentation is improved. By using the document chapter division method provided by the invention, the division process does not need manual participation, and the efficiency of dividing the electronic picture document chapter is improved.
In a second aspect, an embodiment of the present application further provides a document chapter dividing apparatus, including:
the system comprises a user input module, a display module and a display module, wherein the user input module is used for inputting a picture electronic document, and tree-shaped directory structure information and in-page column information corresponding to the picture electronic document;
the page column dividing module is used for dividing the picture electronic document according to the page column information to obtain a unit to be identified;
the optical character recognition module is used for recognizing the character information in the unit to be recognized to obtain a text to be processed and determining the coordinate position of the character information;
the chapter and title matching and dividing module is used for performing regular matching on the text to be processed and the tree-shaped directory structure information and determining chapter contents according to a matching result;
and the segmentation result organization module is used for determining the chapters of the picture electronic document according to the chapter content and the page number information determined by the chapter title matching segmentation module.
In a third aspect, an embodiment of the present application further provides a handwritten chinese character recognition apparatus, including: a memory, a processor, and a user interface;
the memory for storing a computer program;
the user interface is used for realizing interaction with a user;
the processor is used for reading the computer program in the memory, and when the processor executes the computer program, the document chapter division method provided by the invention is realized.
In a fourth aspect, an embodiment of the present invention further provides a processor-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor implements the document chapter division method provided by the present invention.
By the document chapter division method, the document chapter division device and the storage medium, accuracy and efficiency of picture electronic document chapter division can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a document chapter segmentation method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an information input process provided in an embodiment of the present application;
fig. 3 is a schematic diagram of a field segmentation process according to an embodiment of the present application;
fig. 4 is a schematic view of a text recognition process provided in an embodiment of the present application;
fig. 5 is a schematic view of a regular matching process of text information and tree directory structure information provided in an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a segmentation result output process provided in the embodiment of the present application;
FIG. 7 is a schematic structural diagram of a document segmentation apparatus according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another document segmentation apparatus provided in an embodiment of the present application;
fig. 9 is a schematic diagram of a coordinate position of text content according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Some of the words that appear in the text are explained below:
1. the term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
2. In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.
3. The picture electronic document refers to an electronic document which is stored in a picture format by means of shooting, scanning and the like on paper books, manuscripts, documents and the like.
4. The tree directory structure information refers to a directory organized in a tree structure, for example:
chapter one: title 1
1.1 title 2
1.1.1 title 3
1.1.2 title 4
1.2 title 5
1.2.1 title 6
1.2.2 title 7
Chapter two: title 8
Chapter three: title 9
3.1 title 10
3.2 title 11
3.2.1 title 12
3.2.1.1 title 13
3.2.1.2 title 14
In the tree-structured organization directory, a directory hierarchy relationship and titles of each hierarchy are included.
5. The field information is: the method means that column-dividing symbols exist in the page content of the picture electronic document, and one page of information is divided into a plurality of columns.
6. Text content coordinate position: pixel coordinates of the top left corner of a character object appearing in the image, and character width and height. Such as shown in fig. 9;
7. illegal value-software does not expect to get input value, such as type abnormal value, numerical value abnormal, etc., for example, the number of car wheels is generally integer 4, but not decimal, such as 4.2, character X, etc.
8. Legal value: the expected input values for the software.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the display sequence of the embodiment of the present application only represents the sequence of the embodiment, and does not represent the merits of the technical solutions provided by the embodiments.
Example one
Referring to fig. 1, a schematic diagram of a document chapter segmentation method provided in an embodiment of the present application, as shown in the drawing, the method includes steps S101 to S105:
s101, reading a picture electronic document, wherein tree-shaped directory structure information and in-page column information of the picture electronic document;
according to the document chapter segmentation method provided by the invention, the required input information comprises the picture electronic document to be processed, and the tree-shaped directory structure information and the in-page column information corresponding to the picture electronic document. A photo electronic document may be composed of multiple photos.
It should be noted that the number of pieces of information of the in-page field may be 0, which means that there is no division of the in-page field. Each picture has corresponding in-page field information.
As a preferred example, the input flow of the picture electronic document, the tree directory structure information and the column information in the page of the picture electronic document is shown in fig. 2, and includes:
s201, reading the picture electronic document, and determining page number and page number information;
s202, reading tree directory structure information corresponding to the picture electronic document, wherein the tree directory structure information comprises hierarchy information, and separators between chapter header information corresponding to the hierarchy information and chapter header information of the same level;
s203, judging whether the separators are correct or not, and if not, prompting to input the tree directory structure information again;
s204, reading field information corresponding to the picture electronic document;
s205, judging whether the field information is correct or not, and prompting to re-input the field information if the field information is incorrect.
It should be noted that, whether the field information is correct or not can be determined by whether the read value is an illegal value, that is, if the read value is an illegal value, the field information is incorrect, and the field information is prompted to be input again. A preferred example is given below in connection with the specific steps:
a1, prompting to enter the picture electronic document, reading the picture electronic document, and determining the basic information of the picture electronic document. The basic information includes a page number, and the like.
And A2, prompting to enter the directory structure information and reading the directory structure information. The directory structure information is divided according to the hierarchy, and corresponding chapter title information needs to be respectively input according to the hierarchy. The information of the same-level chapter headers is separated by a preset separator;
a3, judging whether the separator of the same level catalog input by the user is correct, if so, continuing to execute step A5;
a4, if the separator information of the input same-level directory is judged to be incorrect in step A3, returning to execute step A2 and prompting to re-input the directory structure information;
a5, prompting to enter field information and reading the field information. The field information is an integer value which is greater than or equal to zero;
a6, judging whether the input column information is a legal value, if so, continuing to execute the step A8;
a7, if the input field information is judged to be an illegal value in the step A6, returning to execute the step A5;
and A8, storing the input picture electronic document, the directory structure information and the field information.
S102, performing column segmentation on the picture electronic document according to the column information in the page to obtain a unit to be identified;
it should be noted that the field segmentation needs to be performed separately for each picture. As a preferred example, the field segmentation for each picture includes steps S301 to S305 as shown in fig. 3:
s301, reading column information corresponding to the current picture;
s302, judging whether the field data is smaller than 1, if so, executing a step S306, otherwise, executing a step S303;
s303, carrying out image binarization processing on the current picture to obtain a first picture;
s304, determining longitudinal black pixel distribution information and column number information in the first picture, and determining position information of a column symbol;
s305, segmenting the current picture according to the position information of the column to obtain a unit to be identified.
And S306, ending.
As a preferred example, the following gives the field segmentation process of the whole picture electronic document:
b1, reading the stored field information;
b2, judging whether the number of the columns is more than 1, if so, continuing to execute the step B4;
b3, in step B2, if the number of columns is less than 1, the process is ended. It should be noted that if the number of fields is less than 1, it indicates that no subfield exists in the page, and no segmentation operation is required;
and B4, carrying out image binarization processing on the page of the picture electronic document. The image binarization processing is to set the gray value of a pixel point on a page image to be 0 or 255, so that the whole page image has an obvious effect of non-black or white;
b5, reading a page image subjected to binarization processing, and counting distribution information of longitudinal black pixel points on the page;
b6, comprehensively judging the position information of the column symbol on the current page image according to the distribution information of the longitudinal black pixel points and the column number information in the step B5;
b7, performing field segmentation processing on the current page image according to the position information of the field symbol on the current page image obtained in the step B6, wherein each segmented image is a unit to be processed in the subsequent step;
b8, storing the unit to be processed which is divided in the step B7;
b9, judging whether the page images of all the picture electronic documents are subjected to field segmentation processing, and if not, continuing to execute the step B11;
b10, in step B9, if the processing is judged to be finished, the segmentation operation is finished;
b11, updating the page image which needs to be processed by the column segmentation at present, and returning to execute the step B5.
S103, recognizing the character information in the unit to be recognized to obtain a text to be processed;
in this step, the text information in the unit to be recognized is recognized, which may be implemented by an end-to-end optical character recognition technique based on a Convolutional Recurrent Neural Network (CRNN) and an attention (attention) mechanism, or by other techniques, and this embodiment is not limited.
As a preferred example, the recognition process is shown in fig. 4, and includes steps S401 to S405:
s401, reading a unit to be identified;
s402, carrying out optical character recognition on the unit to be recognized; in this step, the coordinate position of the text content is also determined.
S403, storing the character content identification result; in this step, the stored content includes the coordinate position and the text content.
S404, judging whether all the units to be identified are processed or not, if so, executing S405, and otherwise, executing S401.
And S405, ending.
After the processing of the step, the character content and the coordinate position information of all the units to be processed after the column segmentation are identified.
S104, performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter contents according to a matching result;
as a preferred example, the regular matching process may include steps S501 to S503 as shown in FIG. 5:
s501, judging whether the regular matching is successful, and if the regular matching is successful, recording chapter title information, the level of the chapter title and the corresponding page position;
s502, recording the content of the chapter according to the chapter title information, the hierarchy of the chapter title and the corresponding page position, and carrying out segmentation operation on the picture corresponding to the chapter; as a preferred example, the segmentation is performed by performing horizontal segmentation on the image according to a certain row coordinate position, and the field segmentation is performed by performing vertical segmentation.
S503, storing a segmentation result, wherein the segmentation result comprises chapter header information, the content contained in the chapter, the starting page number of the chapter and the ending page number of the chapter.
One specific example is given below:
c1, reading the stored directory structure information;
c2, respectively constructing a matching regular expression for each chapter title in different levels according to the chapter title hierarchy information and the chapter title content information contained in the directory structure information;
c3, reading the text content information of a unit to be processed;
c4, executing regular matching in the current processing unit according to the regular expression information obtained in the step C2;
c5, judging whether the regular expressions in the current unit to be processed are successfully matched, if so, executing the step C8, and if not, executing the step C6; if the regular expression matching is successful, the chapter title is found in the current processing unit; the lack of a regular expression match is an indication that no chapter title is found in the current processing unit.
C6, judging whether all the units to be processed have been subjected to regular searching matching related to chapter title content at present, and if all the processing units have been subjected to regular matching, ending;
c7, in step C6, if all the processing units have not been subjected to the regular matching, it indicates that the units yet to be processed need to be subjected to the chapter header matching and dividing processing, and the step C3 is executed again;
c8, in the step C5, if the chapter title is found in the current unit to be processed, the chapter title information, the chapter level and the page position information of the chapter title information are required to be recorded;
c9, judging whether the chapter contents of some levels can be determined according to the chapter title information recorded in the step C8 and the chapter levels and page position information of the chapter, and if the chapter contents of some levels can be determined, continuing to execute the step C11;
c10, if the chapter contents of all the layers cannot be determined in step C9, the process returns to step C6;
c11, recording the chapter contents of certain levels determined in the step C9, and executing page graphic unit segmentation operation;
and C12, storing the segmentation result, wherein the segmentation result comprises a chapter title, the content contained in the title, the starting page number, the ending page number and the like, and returning to execute the step C6.
And after the regular matching, finishing the title searching and matching process of all the units to be processed.
And S105, determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.
In this step, the segmentation results may be reorganized according to a directory structure or by page.
Wherein, the case center organization division result may include:
sorting all chapters in ascending order according to the page number;
and determining the contents of different chapters according to the tree directory structure information and the arrangement sequence of the initial page numbers. In this embodiment, the contents of different chapters are determined, and the output result process is organized according to the directory, that is, the directory information of the picture electronic document is organized and obtained according to the mapping relationship between the contents and the chapter numbers.
By the method, after the column segmentation is carried out on the picture document, the character information is identified, the coordinate position of the character information is determined, and then the character information is regularly matched with the tree-shaped directory structure information, so that the efficient chapter content segmentation is realized.
The method provided by the invention has the advantages that the manual participation is not needed in the segmentation process, and the segmentation efficiency is improved. The image character content is strictly matched with the tree-shaped directory structure through automatic identification and regular matching, so that the accuracy of chapter segmentation is improved.
Example two
Based on the same inventive concept, an embodiment of the present invention further provides a document chapter dividing apparatus, as shown in fig. 7, the apparatus includes:
the system comprises a user input module 701, a document processing module and a document processing module, wherein the user input module is used for inputting a picture electronic document, and tree-shaped directory structure information and in-page column information corresponding to the picture electronic document;
an in-page column segmentation module 702, configured to segment the electronic picture document according to the in-page column information to obtain a unit to be identified;
the optical character recognition module 703 is configured to recognize text information in the unit to be recognized, obtain a text to be processed, and determine a coordinate position of the text information;
the section header matching and segmenting module 704 is used for performing regular matching on the text to be processed and the tree-shaped directory structure information and determining section contents according to a matching result;
and the segmentation result organizing module 705 is used for determining the chapters of the picture electronic document according to the chapter content and the page number information determined by the chapter title matching and segmenting module.
It should be noted that, the user input module 701 provided in this embodiment can implement the information input process in fig. 2, solve the same technical problem, achieve the same technical effect, and is not described herein again;
correspondingly, the intra-page field segmentation module 702 provided in this embodiment can implement all the functions of intra-page field segmentation shown in fig. 3, solve the same technical problem, achieve the same technical effect, and is not described herein again;
correspondingly, the optical character recognition module 703 provided in this embodiment can implement all the functions of optical recognition shown in fig. 4, solve the same technical problems, achieve the same technical effects, and is not described herein again;
correspondingly, the chapter header matching and segmenting module 704 provided in this embodiment can implement the regular matching process shown in fig. 5, solve the same technical problem, achieve the same technical effect, and is not described herein again;
accordingly, the segmentation result organizing module 705 provided in this embodiment can implement all functions of the segmentation result organizing shown in fig. 6, solve the same technical problem, and achieve the same technical effect, which is not described herein again.
It should be noted that the apparatus provided in the second embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect, and the apparatus provided in the second embodiment can implement all the methods of the first embodiment, and the same parts are not described again.
EXAMPLE III
Based on the same inventive concept, an embodiment of the present invention further provides a document chapter dividing apparatus, as shown in fig. 8, the apparatus includes:
including memory 802, processor 801, and user interface 803;
the memory 802 for storing a computer program;
the user interface 803 is used for realizing interaction with a user;
the processor 801 is configured to read the computer program in the memory 802, and when the processor 801 executes the computer program, the processor 801 implements:
reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document;
according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified;
identifying the character information in the unit to be identified to obtain a text to be processed;
performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result;
and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.
Wherein in fig. 8, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 801, and various circuits, represented by memory 802, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The processor 801 is responsible for managing the bus architecture and general processing, and the memory 802 may store data used by the processor 801 in performing operations.
The processor 801 may be a CPU, ASIC, FPGA or CPLD, and the processor 801 may also employ a multi-core architecture.
The processor 801, when executing the computer program stored in the memory 802, implements any of the document chapter division methods shown in fig. 1 to 6.
It should be noted that the apparatus provided in the third embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect, and the apparatus provided in the third embodiment can implement all the methods of the first embodiment, and the same parts are not described again.
The present application also proposes a processor-readable storage medium. Wherein the processor-readable storage medium stores a computer program, and the processor implements the method for implementing any one of the hand document chapter division methods shown in fig. 1 to 6 when executing the computer program.
It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A document chapter division method, comprising:
reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document;
according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified;
identifying the character information in the unit to be identified to obtain a text to be processed;
performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result;
and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.
2. The method of claim 1, wherein the inputting the pictorial electronic document, the tree directory structure information and the in-page field information of the pictorial electronic document comprises:
reading the picture electronic document, and determining page number and page number information;
reading tree directory structure information corresponding to the picture electronic document, wherein the tree directory structure information comprises hierarchy information, and separators between chapter header information corresponding to the hierarchy information and chapter header information of the same level;
judging whether the separator is correct or not, and prompting to re-input the tree directory structure information if the separator is incorrect;
reading column information corresponding to the picture electronic document;
and judging whether the column information is correct or not, and prompting to re-input the column information if the column information is incorrect.
3. The method of claim 1, wherein the performing field segmentation on the electronic picture document according to the in-page field information to obtain a unit to be recognized comprises:
for each picture in the picture electronic document, performing the following operations:
reading column information corresponding to a current picture;
and if the number of the columns of the column information is less than 1, not executing column segmentation, otherwise, executing column segmentation.
4. The method of claim 3, wherein the field segmentation comprises:
carrying out image binarization processing on the current picture to obtain a first picture;
determining longitudinal black pixel distribution information and column number information in the first picture, and determining position information of a column symbol;
and segmenting the current picture according to the position information of the column to obtain a unit to be identified.
5. The method of claim 1, wherein the recognizing the text information in the unit to be recognized to obtain the text to be processed comprises:
executing the following operations on all units to be identified of the picture electronic document:
identifying the text content in the unit to be identified, and determining the coordinate position of the text content;
storing the coordinate location and the text content.
6. The method according to claim 1, wherein the performing a regular matching between the text to be processed and the tree directory structure information, and determining the chapter content according to the matching result comprises:
judging whether the regular matching is successful, if so, recording chapter title information, the level of the chapter title and the corresponding page position;
recording the content of the chapter according to the chapter title information, the hierarchy of the chapter title and the corresponding page position, and carrying out segmentation operation on the picture corresponding to the chapter;
storing a segmentation result, wherein the segmentation result comprises chapter header information, the content of the chapter is extracted from the chapter set, the starting page number of the chapter and the ending page number of the chapter.
7. The method of claim 6, wherein determining the section of the pictorial electronic document based on the section content and the tree directory structure information comprises:
sorting all chapters in ascending order according to the page number;
and determining the contents of different chapters according to the tree directory structure information and the arrangement sequence of the initial page numbers.
8. A document chapter division apparatus, comprising:
the system comprises a user input module, a display module and a display module, wherein the user input module is used for inputting a picture electronic document, and tree-shaped directory structure information and in-page column information corresponding to the picture electronic document;
the page column dividing module is used for dividing the picture electronic document according to the page column information to obtain a unit to be identified;
the optical character recognition module is used for recognizing the character information in the unit to be recognized to obtain a text to be processed and determining the coordinate position of the character information;
the chapter and title matching and dividing module is used for performing regular matching on the text to be processed and the tree-shaped directory structure information and determining chapter contents according to a matching result;
and the segmentation result organization module is used for determining the chapters of the picture electronic document according to the chapter content and the page number information determined by the chapter title matching segmentation module.
9. A document chapter segmentation apparatus comprising a memory, a processor, and a user interface;
the memory for storing a computer program;
the user interface is used for realizing interaction with a user;
the processor, configured to read the computer program in the memory, and when the processor executes the computer program, implement the document chapter division method according to one of claims 1 to 8.
10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program which, when executed by a processor, implements a document chapter division method according to one of claims 1 to 8.
CN202011106303.3A 2020-10-16 2020-10-16 Document chapter segmentation method and device and storage medium Pending CN112329548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011106303.3A CN112329548A (en) 2020-10-16 2020-10-16 Document chapter segmentation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011106303.3A CN112329548A (en) 2020-10-16 2020-10-16 Document chapter segmentation method and device and storage medium

Publications (1)

Publication Number Publication Date
CN112329548A true CN112329548A (en) 2021-02-05

Family

ID=74313851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011106303.3A Pending CN112329548A (en) 2020-10-16 2020-10-16 Document chapter segmentation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112329548A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221792A (en) * 2021-05-21 2021-08-06 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment
CN113282811A (en) * 2021-05-27 2021-08-20 广州文石信息科技有限公司 MOBI document display method, device and equipment
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN115393198A (en) * 2022-10-27 2022-11-25 国泰新点软件股份有限公司 Method and device for processing pictures in file and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character
CN110175322A (en) * 2019-05-22 2019-08-27 北京神州泰岳软件股份有限公司 A kind of structural method and device of document
CN110390269A (en) * 2019-06-26 2019-10-29 平安科技(深圳)有限公司 PDF document table extracting method, device, equipment and computer readable storage medium
CN110728687A (en) * 2019-10-15 2020-01-24 卓尔智联(武汉)研究院有限公司 File image segmentation method and device, computer equipment and storage medium
CN111753534A (en) * 2019-03-29 2020-10-09 柯尼卡美能达美国商务解决方案有限公司 Identifying sequence titles in a document

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251892A (en) * 2008-03-07 2008-08-27 北大方正集团有限公司 Method and apparatus for cutting character
CN111753534A (en) * 2019-03-29 2020-10-09 柯尼卡美能达美国商务解决方案有限公司 Identifying sequence titles in a document
CN110175322A (en) * 2019-05-22 2019-08-27 北京神州泰岳软件股份有限公司 A kind of structural method and device of document
CN110390269A (en) * 2019-06-26 2019-10-29 平安科技(深圳)有限公司 PDF document table extracting method, device, equipment and computer readable storage medium
CN110728687A (en) * 2019-10-15 2020-01-24 卓尔智联(武汉)研究院有限公司 File image segmentation method and device, computer equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LAOYEZHA: "python3 opencv 基于二值化图像素投影的图片切割方法", pages 1 - 8, Retrieved from the Internet <URL:https://blog.csdn.net/laoyezha/article/details/106587854> *
XIAOYU TANG等: "Regular expression-based reference metadata extraction from the web", 《2010 IEEE 2ND SYMPOSIUM ON WEB SOCIETY》, pages 346 - 350 *
喜欢敲代码的一歪风: "一招解决99%小说目录生成--TXT小说目录正则匹配分割", pages 1 - 5, Retrieved from the Internet <URL:https://blog.csdn.net/qq_43257319/article/details/108530208> *
王威: "基于本体的信息系统知识建模与表示框架的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 03, pages 138 - 7971 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221792A (en) * 2021-05-21 2021-08-06 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment
CN113221792B (en) * 2021-05-21 2022-09-27 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment
CN113282811A (en) * 2021-05-27 2021-08-20 广州文石信息科技有限公司 MOBI document display method, device and equipment
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN115393198A (en) * 2022-10-27 2022-11-25 国泰新点软件股份有限公司 Method and device for processing pictures in file and storage medium

Similar Documents

Publication Publication Date Title
CN112329548A (en) Document chapter segmentation method and device and storage medium
US10885323B2 (en) Digital image-based document digitization using a graph model
CN110516208B (en) System and method for extracting PDF document form
CN102194123B (en) Method and device for defining table template
US20070081179A1 (en) Image processing device, image processing method, and computer program product
KR20160132842A (en) Detecting and extracting image document components to create flow document
KR101235226B1 (en) Image processor and image processing method and recording medium
JP7026165B2 (en) Text recognition method and text recognition device, electronic equipment, storage medium
WO2000052645A1 (en) Document image processor, method for extracting document title, and method for imparting document tag information
JP2004139484A (en) Form processing device, program for implementing it, and program for creating form format
CN113343740B (en) Table detection method, device, equipment and storage medium
WO2020186779A1 (en) Image information identification method and apparatus, and computer device and storage medium
CN111310426A (en) Form format recovery method and device based on OCR and storage medium
JP5380040B2 (en) Document processing device
CN102779276B (en) Text image recognition method and device
CN101833546A (en) Method and device for extracting form from portable electronic document
JPH0314184A (en) Document image rearrangement filing device
CN112906695A (en) Form recognition method adapting to multi-class OCR recognition interface and related equipment
CN115546809A (en) Table structure identification method based on cell constraint and application thereof
CN115828874A (en) Industry table digital processing method based on image recognition technology
US9798711B2 (en) Method and system for generating a graphical organization of a page
CN110688998A (en) Bill identification method and device
Handley Table analysis for multiline cell identification
CN111079709B (en) Electronic document generation method and device, computer equipment and storage medium
CN110147516A (en) The intelligent identification Method and relevant device of front-end code in Pages Design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination