CN112329548A

CN112329548A - Document chapter segmentation method and device and storage medium

Info

Publication number: CN112329548A
Application number: CN202011106303.3A
Authority: CN
Inventors: 薛晗庆; 潘红九; 李昊星; 陈超; 窦小明; 施卫科; 雷净; 李萌萌; 杨飞; 尹琼; 底亚峰; 皮彬睿
Original assignee: Beijing Institute of Near Space Vehicles System Engineering
Current assignee: Beijing Institute of Near Space Vehicles System Engineering
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-02-05

Abstract

The application discloses a document chapter division method, a document chapter division device and a storage medium, which are used for improving the speed and the accuracy of chapter content division of a picture electronic document. The document chapter segmentation method provided by the application comprises the following steps: reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document; according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified; identifying the character information in the unit to be identified to obtain a text to be processed; performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result; and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information. The application also provides a document chapter segmentation device and a storage medium.

Description

Document chapter segmentation method and device and storage medium

Technical Field

The present application relates to the field of information processing, and in particular, to a method and an apparatus for segmenting a document chapter, and a storage medium.

Background

With the continuous development of information technology, the use of electronic books and documents is becoming more widespread and common. The picture electronic document refers to an electronic document which is stored in a picture format by means of shooting, scanning and the like on paper books, manuscripts, documents and the like. The picture electronic document is usually composed of a single picture, so that it is difficult for a user of the picture electronic document to know the overall structural distribution of the document, and especially, it is inconvenient to search the contents contained in each level of chapter title. This makes tasks distributed based on the chapter structure of the picture electronic document (such as chapter content classification, chapter content matching, etc.) difficult to handle. In order to obtain the overall structural distribution of the picture electronic document, the content of each level of the chapter title of the picture electronic document needs to be segmented. In the prior art, the content of the chapter header is determined and divided based on the sparseness degree of each black pixel line in the picture content, the accuracy rate of the division result is low, and the chapter header to which the division result belongs is manually confirmed after division, so that the efficiency is low.

Disclosure of Invention

In view of the foregoing technical problems, embodiments of the present application provide a document chapter division method, apparatus, and storage medium, which are used for speed and accuracy of chapter content division of an electronic document.

In a first aspect, a document chapter segmentation method provided in an embodiment of the present application includes:

reading a picture electronic document, wherein the tree-shaped directory structure information and the in-page column information of the picture electronic document;

according to the in-page column information, performing column segmentation on the picture electronic document to obtain a unit to be identified;

identifying the character information in the unit to be identified to obtain a text to be processed;

performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter content according to a matching result;

and determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.

Further, the inputting the picture electronic document, the tree directory structure information and the in-page field information of the picture electronic document include:

reading the picture electronic document, and determining page number and page number information;

reading tree directory structure information corresponding to the picture electronic document, wherein the tree directory structure information comprises hierarchy information, and separators between chapter header information corresponding to the hierarchy information and chapter header information of the same level;

judging whether the separator is correct or not, and prompting to re-input the tree directory structure information if the separator is incorrect;

reading column information corresponding to the picture electronic document;

and judging whether the column information is correct or not, and prompting to re-input the column information if the column information is incorrect.

Further, the performing, according to the in-page column information, column segmentation on the picture electronic document to obtain a unit to be recognized includes:

for each picture in the picture electronic document, performing the following operations:

reading column information corresponding to a current picture;

and if the number of the columns of the column information is less than 1, not executing column segmentation, otherwise, executing column segmentation.

Preferably, the field segmentation includes:

carrying out image binarization processing on the current picture to obtain a first picture;

determining longitudinal black pixel distribution information and column number information in the first picture, and determining position information of a column symbol;

and segmenting the current picture according to the position information of the column to obtain a unit to be identified.

Further, the recognizing the text information in the unit to be recognized to obtain the text to be processed includes:

executing the following operations on all units to be identified of the picture electronic document:

identifying the text content in the unit to be identified, and determining the coordinate position of the text content;

storing the coordinate location and the text content.

Further, the performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining the chapter content according to the matching result includes:

judging whether the regular matching is successful, if so, recording chapter title information, the level of the chapter title and the corresponding page position;

recording the content of the chapter according to the chapter title information, the hierarchy of the chapter title and the corresponding page position, and carrying out segmentation operation on the picture corresponding to the chapter;

storing a segmentation result, wherein the segmentation result comprises chapter header information, the content of the chapter is extracted from the chapter set, the starting page number of the chapter and the ending page number of the chapter.

Preferably, the determining the chapter of the picture electronic document according to the chapter content and the tree directory structure information includes:

sorting all chapters in ascending order according to the page number;

and determining the contents of different chapters according to the tree directory structure information and the arrangement sequence of the initial page numbers.

By using the document chapter segmentation method provided by the invention, the text content information in the electronic picture document can be identified, and the text content information is regularly matched with the tree-shaped directory structure of the electronic picture document, so that the accuracy of chapter content segmentation is improved. By using the document chapter division method provided by the invention, the division process does not need manual participation, and the efficiency of dividing the electronic picture document chapter is improved.

In a second aspect, an embodiment of the present application further provides a document chapter dividing apparatus, including:

the system comprises a user input module, a display module and a display module, wherein the user input module is used for inputting a picture electronic document, and tree-shaped directory structure information and in-page column information corresponding to the picture electronic document;

the page column dividing module is used for dividing the picture electronic document according to the page column information to obtain a unit to be identified;

the optical character recognition module is used for recognizing the character information in the unit to be recognized to obtain a text to be processed and determining the coordinate position of the character information;

the chapter and title matching and dividing module is used for performing regular matching on the text to be processed and the tree-shaped directory structure information and determining chapter contents according to a matching result;

and the segmentation result organization module is used for determining the chapters of the picture electronic document according to the chapter content and the page number information determined by the chapter title matching segmentation module.

In a third aspect, an embodiment of the present application further provides a handwritten chinese character recognition apparatus, including: a memory, a processor, and a user interface;

the memory for storing a computer program;

the user interface is used for realizing interaction with a user;

the processor is used for reading the computer program in the memory, and when the processor executes the computer program, the document chapter division method provided by the invention is realized.

In a fourth aspect, an embodiment of the present invention further provides a processor-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor implements the document chapter division method provided by the present invention.

By the document chapter division method, the document chapter division device and the storage medium, accuracy and efficiency of picture electronic document chapter division can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a document chapter segmentation method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an information input process provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a field segmentation process according to an embodiment of the present application;

fig. 4 is a schematic view of a text recognition process provided in an embodiment of the present application;

fig. 5 is a schematic view of a regular matching process of text information and tree directory structure information provided in an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a segmentation result output process provided in the embodiment of the present application;

FIG. 7 is a schematic structural diagram of a document segmentation apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another document segmentation apparatus provided in an embodiment of the present application;

fig. 9 is a schematic diagram of a coordinate position of text content according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Some of the words that appear in the text are explained below:

1. the term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

2. In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.

3. The picture electronic document refers to an electronic document which is stored in a picture format by means of shooting, scanning and the like on paper books, manuscripts, documents and the like.

4. The tree directory structure information refers to a directory organized in a tree structure, for example:

chapter one: title 1

1.1 title 2

1.1.1 title 3

1.1.2 title 4

1.2 title 5

1.2.1 title 6

1.2.2 title 7

Chapter two: title 8

Chapter three: title 9

3.1 title 10

3.2 title 11

3.2.1 title 12

3.2.1.1 title 13

3.2.1.2 title 14

In the tree-structured organization directory, a directory hierarchy relationship and titles of each hierarchy are included.

5. The field information is: the method means that column-dividing symbols exist in the page content of the picture electronic document, and one page of information is divided into a plurality of columns.

6. Text content coordinate position: pixel coordinates of the top left corner of a character object appearing in the image, and character width and height. Such as shown in fig. 9;

7. illegal value-software does not expect to get input value, such as type abnormal value, numerical value abnormal, etc., for example, the number of car wheels is generally integer 4, but not decimal, such as 4.2, character X, etc.

8. Legal value: the expected input values for the software.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the display sequence of the embodiment of the present application only represents the sequence of the embodiment, and does not represent the merits of the technical solutions provided by the embodiments.

Example one

Referring to fig. 1, a schematic diagram of a document chapter segmentation method provided in an embodiment of the present application, as shown in the drawing, the method includes steps S101 to S105:

s101, reading a picture electronic document, wherein tree-shaped directory structure information and in-page column information of the picture electronic document;

according to the document chapter segmentation method provided by the invention, the required input information comprises the picture electronic document to be processed, and the tree-shaped directory structure information and the in-page column information corresponding to the picture electronic document. A photo electronic document may be composed of multiple photos.

It should be noted that the number of pieces of information of the in-page field may be 0, which means that there is no division of the in-page field. Each picture has corresponding in-page field information.

As a preferred example, the input flow of the picture electronic document, the tree directory structure information and the column information in the page of the picture electronic document is shown in fig. 2, and includes:

s201, reading the picture electronic document, and determining page number and page number information;

s202, reading tree directory structure information corresponding to the picture electronic document, wherein the tree directory structure information comprises hierarchy information, and separators between chapter header information corresponding to the hierarchy information and chapter header information of the same level;

s203, judging whether the separators are correct or not, and if not, prompting to input the tree directory structure information again;

s204, reading field information corresponding to the picture electronic document;

s205, judging whether the field information is correct or not, and prompting to re-input the field information if the field information is incorrect.

It should be noted that, whether the field information is correct or not can be determined by whether the read value is an illegal value, that is, if the read value is an illegal value, the field information is incorrect, and the field information is prompted to be input again. A preferred example is given below in connection with the specific steps:

a1, prompting to enter the picture electronic document, reading the picture electronic document, and determining the basic information of the picture electronic document. The basic information includes a page number, and the like.

And A2, prompting to enter the directory structure information and reading the directory structure information. The directory structure information is divided according to the hierarchy, and corresponding chapter title information needs to be respectively input according to the hierarchy. The information of the same-level chapter headers is separated by a preset separator;

a3, judging whether the separator of the same level catalog input by the user is correct, if so, continuing to execute step A5;

a4, if the separator information of the input same-level directory is judged to be incorrect in step A3, returning to execute step A2 and prompting to re-input the directory structure information;

a5, prompting to enter field information and reading the field information. The field information is an integer value which is greater than or equal to zero;

a6, judging whether the input column information is a legal value, if so, continuing to execute the step A8;

a7, if the input field information is judged to be an illegal value in the step A6, returning to execute the step A5;

and A8, storing the input picture electronic document, the directory structure information and the field information.

S102, performing column segmentation on the picture electronic document according to the column information in the page to obtain a unit to be identified;

it should be noted that the field segmentation needs to be performed separately for each picture. As a preferred example, the field segmentation for each picture includes steps S301 to S305 as shown in fig. 3:

s301, reading column information corresponding to the current picture;

s302, judging whether the field data is smaller than 1, if so, executing a step S306, otherwise, executing a step S303;

s303, carrying out image binarization processing on the current picture to obtain a first picture;

s304, determining longitudinal black pixel distribution information and column number information in the first picture, and determining position information of a column symbol;

s305, segmenting the current picture according to the position information of the column to obtain a unit to be identified.

And S306, ending.

As a preferred example, the following gives the field segmentation process of the whole picture electronic document:

b1, reading the stored field information;

b2, judging whether the number of the columns is more than 1, if so, continuing to execute the step B4;

b3, in step B2, if the number of columns is less than 1, the process is ended. It should be noted that if the number of fields is less than 1, it indicates that no subfield exists in the page, and no segmentation operation is required;

and B4, carrying out image binarization processing on the page of the picture electronic document. The image binarization processing is to set the gray value of a pixel point on a page image to be 0 or 255, so that the whole page image has an obvious effect of non-black or white;

b5, reading a page image subjected to binarization processing, and counting distribution information of longitudinal black pixel points on the page;

b6, comprehensively judging the position information of the column symbol on the current page image according to the distribution information of the longitudinal black pixel points and the column number information in the step B5;

b7, performing field segmentation processing on the current page image according to the position information of the field symbol on the current page image obtained in the step B6, wherein each segmented image is a unit to be processed in the subsequent step;

b8, storing the unit to be processed which is divided in the step B7;

b9, judging whether the page images of all the picture electronic documents are subjected to field segmentation processing, and if not, continuing to execute the step B11;

b10, in step B9, if the processing is judged to be finished, the segmentation operation is finished;

b11, updating the page image which needs to be processed by the column segmentation at present, and returning to execute the step B5.

S103, recognizing the character information in the unit to be recognized to obtain a text to be processed;

in this step, the text information in the unit to be recognized is recognized, which may be implemented by an end-to-end optical character recognition technique based on a Convolutional Recurrent Neural Network (CRNN) and an attention (attention) mechanism, or by other techniques, and this embodiment is not limited.

As a preferred example, the recognition process is shown in fig. 4, and includes steps S401 to S405:

s401, reading a unit to be identified;

s402, carrying out optical character recognition on the unit to be recognized; in this step, the coordinate position of the text content is also determined.

S403, storing the character content identification result; in this step, the stored content includes the coordinate position and the text content.

S404, judging whether all the units to be identified are processed or not, if so, executing S405, and otherwise, executing S401.

And S405, ending.

After the processing of the step, the character content and the coordinate position information of all the units to be processed after the column segmentation are identified.

S104, performing regular matching on the text to be processed and the tree-shaped directory structure information, and determining chapter contents according to a matching result;

as a preferred example, the regular matching process may include steps S501 to S503 as shown in FIG. 5:

s501, judging whether the regular matching is successful, and if the regular matching is successful, recording chapter title information, the level of the chapter title and the corresponding page position;

s502, recording the content of the chapter according to the chapter title information, the hierarchy of the chapter title and the corresponding page position, and carrying out segmentation operation on the picture corresponding to the chapter; as a preferred example, the segmentation is performed by performing horizontal segmentation on the image according to a certain row coordinate position, and the field segmentation is performed by performing vertical segmentation.

S503, storing a segmentation result, wherein the segmentation result comprises chapter header information, the content contained in the chapter, the starting page number of the chapter and the ending page number of the chapter.

One specific example is given below:

c1, reading the stored directory structure information;

c2, respectively constructing a matching regular expression for each chapter title in different levels according to the chapter title hierarchy information and the chapter title content information contained in the directory structure information;

c3, reading the text content information of a unit to be processed;

c4, executing regular matching in the current processing unit according to the regular expression information obtained in the step C2;

c5, judging whether the regular expressions in the current unit to be processed are successfully matched, if so, executing the step C8, and if not, executing the step C6; if the regular expression matching is successful, the chapter title is found in the current processing unit; the lack of a regular expression match is an indication that no chapter title is found in the current processing unit.

C6, judging whether all the units to be processed have been subjected to regular searching matching related to chapter title content at present, and if all the processing units have been subjected to regular matching, ending;

c7, in step C6, if all the processing units have not been subjected to the regular matching, it indicates that the units yet to be processed need to be subjected to the chapter header matching and dividing processing, and the step C3 is executed again;

c8, in the step C5, if the chapter title is found in the current unit to be processed, the chapter title information, the chapter level and the page position information of the chapter title information are required to be recorded;

c9, judging whether the chapter contents of some levels can be determined according to the chapter title information recorded in the step C8 and the chapter levels and page position information of the chapter, and if the chapter contents of some levels can be determined, continuing to execute the step C11;

c10, if the chapter contents of all the layers cannot be determined in step C9, the process returns to step C6;

c11, recording the chapter contents of certain levels determined in the step C9, and executing page graphic unit segmentation operation;

and C12, storing the segmentation result, wherein the segmentation result comprises a chapter title, the content contained in the title, the starting page number, the ending page number and the like, and returning to execute the step C6.

And after the regular matching, finishing the title searching and matching process of all the units to be processed.

And S105, determining the chapters of the picture electronic document according to the chapter content and the tree directory structure information.

In this step, the segmentation results may be reorganized according to a directory structure or by page.

Wherein, the case center organization division result may include:

sorting all chapters in ascending order according to the page number;

and determining the contents of different chapters according to the tree directory structure information and the arrangement sequence of the initial page numbers. In this embodiment, the contents of different chapters are determined, and the output result process is organized according to the directory, that is, the directory information of the picture electronic document is organized and obtained according to the mapping relationship between the contents and the chapter numbers.

By the method, after the column segmentation is carried out on the picture document, the character information is identified, the coordinate position of the character information is determined, and then the character information is regularly matched with the tree-shaped directory structure information, so that the efficient chapter content segmentation is realized.

The method provided by the invention has the advantages that the manual participation is not needed in the segmentation process, and the segmentation efficiency is improved. The image character content is strictly matched with the tree-shaped directory structure through automatic identification and regular matching, so that the accuracy of chapter segmentation is improved.

Example two

Based on the same inventive concept, an embodiment of the present invention further provides a document chapter dividing apparatus, as shown in fig. 7, the apparatus includes:

the system comprises a user input module 701, a document processing module and a document processing module, wherein the user input module is used for inputting a picture electronic document, and tree-shaped directory structure information and in-page column information corresponding to the picture electronic document;

an in-page column segmentation module 702, configured to segment the electronic picture document according to the in-page column information to obtain a unit to be identified;

the optical character recognition module 703 is configured to recognize text information in the unit to be recognized, obtain a text to be processed, and determine a coordinate position of the text information;

the section header matching and segmenting module 704 is used for performing regular matching on the text to be processed and the tree-shaped directory structure information and determining section contents according to a matching result;

and the segmentation result organizing module 705 is used for determining the chapters of the picture electronic document according to the chapter content and the page number information determined by the chapter title matching and segmenting module.

It should be noted that, the user input module 701 provided in this embodiment can implement the information input process in fig. 2, solve the same technical problem, achieve the same technical effect, and is not described herein again;

correspondingly, the intra-page field segmentation module 702 provided in this embodiment can implement all the functions of intra-page field segmentation shown in fig. 3, solve the same technical problem, achieve the same technical effect, and is not described herein again;

correspondingly, the optical character recognition module 703 provided in this embodiment can implement all the functions of optical recognition shown in fig. 4, solve the same technical problems, achieve the same technical effects, and is not described herein again;

correspondingly, the chapter header matching and segmenting module 704 provided in this embodiment can implement the regular matching process shown in fig. 5, solve the same technical problem, achieve the same technical effect, and is not described herein again;

accordingly, the segmentation result organizing module 705 provided in this embodiment can implement all functions of the segmentation result organizing shown in fig. 6, solve the same technical problem, and achieve the same technical effect, which is not described herein again.

It should be noted that the apparatus provided in the second embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect, and the apparatus provided in the second embodiment can implement all the methods of the first embodiment, and the same parts are not described again.

EXAMPLE III

Based on the same inventive concept, an embodiment of the present invention further provides a document chapter dividing apparatus, as shown in fig. 8, the apparatus includes:

including memory 802, processor 801, and user interface 803;

the memory 802 for storing a computer program;

the user interface 803 is used for realizing interaction with a user;

the processor 801 is configured to read the computer program in the memory 802, and when the processor 801 executes the computer program, the processor 801 implements:

Wherein in fig. 8, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 801, and various circuits, represented by memory 802, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The processor 801 is responsible for managing the bus architecture and general processing, and the memory 802 may store data used by the processor 801 in performing operations.

The processor 801 may be a CPU, ASIC, FPGA or CPLD, and the processor 801 may also employ a multi-core architecture.

The processor 801, when executing the computer program stored in the memory 802, implements any of the document chapter division methods shown in fig. 1 to 6.

It should be noted that the apparatus provided in the third embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect, and the apparatus provided in the third embodiment can implement all the methods of the first embodiment, and the same parts are not described again.

The present application also proposes a processor-readable storage medium. Wherein the processor-readable storage medium stores a computer program, and the processor implements the method for implementing any one of the hand document chapter division methods shown in fig. 1 to 6 when executing the computer program.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A document chapter division method, comprising:

2. The method of claim 1, wherein the inputting the pictorial electronic document, the tree directory structure information and the in-page field information of the pictorial electronic document comprises:

reading column information corresponding to the picture electronic document;

3. The method of claim 1, wherein the performing field segmentation on the electronic picture document according to the in-page field information to obtain a unit to be recognized comprises:

reading column information corresponding to a current picture;

4. The method of claim 3, wherein the field segmentation comprises:

5. The method of claim 1, wherein the recognizing the text information in the unit to be recognized to obtain the text to be processed comprises:

storing the coordinate location and the text content.

6. The method according to claim 1, wherein the performing a regular matching between the text to be processed and the tree directory structure information, and determining the chapter content according to the matching result comprises:

7. The method of claim 6, wherein determining the section of the pictorial electronic document based on the section content and the tree directory structure information comprises:

sorting all chapters in ascending order according to the page number;

8. A document chapter division apparatus, comprising:

9. A document chapter segmentation apparatus comprising a memory, a processor, and a user interface;

the memory for storing a computer program;

the user interface is used for realizing interaction with a user;

the processor, configured to read the computer program in the memory, and when the processor executes the computer program, implement the document chapter division method according to one of claims 1 to 8.

10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program which, when executed by a processor, implements a document chapter division method according to one of claims 1 to 8.