CN110688842B - Analysis method, device and server for document title level - Google Patents

Analysis method, device and server for document title level Download PDF

Info

Publication number
CN110688842B
CN110688842B CN201910972519.9A CN201910972519A CN110688842B CN 110688842 B CN110688842 B CN 110688842B CN 201910972519 A CN201910972519 A CN 201910972519A CN 110688842 B CN110688842 B CN 110688842B
Authority
CN
China
Prior art keywords
title
header
document
titles
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910972519.9A
Other languages
Chinese (zh)
Other versions
CN110688842A (en
Inventor
任宁
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co ltd
Original Assignee
Dingfu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co ltd filed Critical Dingfu Intelligent Technology Co ltd
Priority to CN201910972519.9A priority Critical patent/CN110688842B/en
Publication of CN110688842A publication Critical patent/CN110688842A/en
Application granted granted Critical
Publication of CN110688842B publication Critical patent/CN110688842B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method, a device and a server for analyzing a document title level. The method comprises the following steps: assigning a title ID to each title of the document, the title IDs being incremented according to the order of the titles in the document; determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased according to the sequence of the titles in the category to which the title belongs; according to the title ID, category and arrangement number of the title, determining the upper title ID of each title, wherein the upper title ID is the title ID of the upper title of the title; determining the generic relationship between the titles according to the upper title ID; the hierarchy of each title is determined based on the generic relationship between the titles. Therefore, the embodiment of the application determines the hierarchy of the title according to the characteristic analysis such as the position of the title in the document, character characteristics and the like, and additional rules are not needed, so that the universality is good and the accuracy is higher.

Description

Analysis method, device and server for document title level
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, and a server for analyzing a document title level.
Background
Electronic documents, such as PDF documents, word documents, RTF (Rich Text Format) documents and HTML (HyperText Markup Language ) documents, etc., are the primary media form in which information is carried in various types of computer systems, and are widely used. Therefore, extracting valuable information from electronic documents has become a research hotspot in the field of natural language processing technology in recent years.
Taking as an example the extraction of a document title from an electronic document and the determination of a title hierarchy, a method of identifying a title based on rules is currently commonly employed. According to the method, extraction rules are formulated for the title according to the difference between the text style of the title and the text style of the body, the title is extracted from the electronic document by using the extraction rules, and the title level is determined. However, such rule-based methods have high requirements for rule formulation, and conflicts easily occur between rules, resulting in difficulty in improvement of recognition accuracy of titles. In addition, the rule-based method has no universality, and when the text styles of different electronic documents are various, the extraction rules must be respectively formulated correspondingly, so that the development cost is high. In addition, some electronic documents are not in a standardized format (e.g., PDF documents formed by scanning and photocopying, word documents converted by some tools, etc.), and can also affect the accuracy of such current rule-based methods.
Disclosure of Invention
The embodiment of the application provides a method, a device and a server for analyzing a document title level, which are used for solving the problems of poor universality and low accuracy of the title level extraction based on rules in the prior art.
In a first aspect, an embodiment of the present application provides a method for analyzing a document title hierarchy, including: assigning a title ID to each title of a document, the title IDs being incremented according to the order of the titles in the document; determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category of the title, wherein the arrangement number is increased according to the sequence of the title in the category of the title; determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; determining the generic relationship between the titles according to the upper title ID; and determining the hierarchy of each title according to the generic relation between the titles.
In a second aspect of the present invention, the embodiment of the application provides an analysis device for document title level, which comprises the following components: a title ID generation module, configured to assign a title ID to each title of a document, where the title IDs increment according to the order of the titles in the document; the arrangement number generation module is used for determining the category of each title according to the character characteristics of the title and determining the arrangement number of each title in the category of the title, wherein the arrangement number is increased according to the sequence of the title in the category of the title; the upper title ID generation module is used for determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; the title relationship determining module is used for determining the title relationship between the titles according to the upper title ID; and the document title generation module is used for determining the hierarchy of each title according to the generic relationship between the titles.
In a third aspect, an embodiment of the present application provides a server, including: a memory and a processor, the memory storing program instructions that, when executed by the processor, cause the server to perform the method of any of the above aspects.
According to the technical scheme, the title ID can be distributed to each title of the document, and the title ID is increased according to the sequence of the titles in the document; determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased according to the sequence of the titles in the category to which the title belongs; according to the title ID, category and arrangement number of the title, determining the upper title ID of each title, wherein the upper title ID is the title ID of the upper title of the title; determining the generic relationship between the titles according to the upper title ID; the hierarchy of each title is determined based on the generic relationship between the titles. Therefore, the embodiment of the application determines the hierarchy of the title according to the characteristic analysis such as the position of the title in the document, character characteristics and the like, and additional rules are not needed, so that the universality is good and the accuracy is higher.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart for extracting titles from documents provided by an embodiment of the present application;
FIG. 2 is a flowchart of step S103 of extracting a title from a document provided in an embodiment of the present application;
FIG. 3 is a flowchart of a method for analyzing a document title hierarchy according to an embodiment of the present application;
FIG. 4 is a flowchart of a first stage of determining a higher title ID for each title provided in an embodiment of the present application;
FIG. 5 is a flowchart of a document title hierarchy analysis method step S304 provided in an embodiment of the present application;
FIG. 6 is a flowchart of a method step S305 of analyzing a document title hierarchy provided in an embodiment of the present application;
FIG. 7 is a header topology generated according to Table 4 in accordance with an embodiment of the present application;
fig. 8 is a schematic structural diagram of an analysis device for document title level according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions in the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
Electronic documents, such as PDF documents, word documents, RTF (Rich Text Format) documents, and HTML (HyperText Markup Language ) documents, etc., are the primary media forms that carry information in various types of computer systems, and are widely used. Therefore, extracting valuable information from electronic documents has become a research hotspot in the field of natural language processing technology in recent years.
In a document, since the title is typically a summary of the subject matter of one or more paragraphs following it, extraction and analysis of the title is one of the main directions in which information is obtained from the document. In general, extraction and analysis of titles in a document may include two phases: a first stage of identifying and extracting a title from a document; in the second stage, the extracted titles are further analyzed to determine the collar relationships between the titles and the hierarchy of the titles.
The embodiments of the present application specifically describe the technical solutions of the embodiments of the present application according to the two stages listed above.
In the first stage, titles are identified and extracted from documents.
FIG. 1 is a flow chart for extracting titles from documents provided by an embodiment of the present application. As shown in fig. 1, the first stage of the embodiment of the present application, extracting the title from the document, includes the following steps:
step S101, extracting a header feature set and a body feature set from a known document corpus.
The known document corpus refers to document corpus of which contents are known to be titles and which contents are text, the known document corpus can be collected in a historical document title extraction task, and the known document corpus can be obtained by marking the unknown document corpus in a corpus marking mode. The header feature set contains header content of a known corpus, the body feature set contains body content of the known corpus, and when a header and a body are identified from a document corpus, the identified header may be added to the header feature set, and the identified body may be added to the body feature set.
For example, for the document corpus (hereinafter: corpus 1) noted below:
(V) investment situation analysis/title-
1. Overall analysis of external equity investment/title-
The ∈rule is not applicable
(1) Major equity investment/title-
The ∈rule is not applicable
(2) Major non-equity investments/title-
The ∈rule is not applicable
(3) Financial asset/title ∈
The ∈rule is not applicable
(six) major asset and equity sale/title-
The ∈r is not applicable
On 31, 10, 2018, the company signed a collaborative framework agreement regarding XXX with XXX investment (Shanghai) limited (hereinafter "XXX"): details are found in bulletin about the intent to sell assets, published on 1, 11, 2018
In corpus 1, with paragraphs as labeling units, paragraphs belonging to a title are labeled "/title/", the contents of the paragraphs labeled "/title/" are added to the title feature set, and the contents of the remaining paragraphs are added to the text feature set.
And step S102, training by using the title feature set and the text feature set to obtain an analytical model based on a machine learning classification algorithm.
The analytical model may be, for example, a support vector machine (support vector machine, SVM) algorithm model that may enable classification of unknown samples through multiple classes of corpus training. For example, when the algorithm model is trained using a set of title features and the set of text features, the algorithm model may be provided with the ability to identify titles and text in an unknown document. Specifically, the title feature set and the text feature set which are input by training the algorithm model can contain the characters of the font style, the font size, the font height, the width and the like of the corresponding title and text, so that the algorithm model learns the characters in the training process and has the capability of identifying the title and the text in the unknown document according to the characters.
Step S103, extracting the title from the document by using the analytical model.
When extracting the title from the document using the parsing model, the document needs to be processed to obtain a format acceptable to the parsing model, and then step S103, as shown in fig. 2, may specifically include the following steps:
step S201, the coordinate position of each character in the document is acquired.
Specifically, the whole document may be subjected to character analysis in units of characters to obtain character features such as positions of the characters (for example, X-axis coordinates and Y-axis coordinates of the characters in the document page), character sizes, character patterns, and the like, wherein the X-axis coordinates may be coordinates along the width direction of the document page, and the Y-axis coordinates may be coordinates along the height direction of the document page.
To acquire the above feature, a two-dimensional coordinate system including an X axis in the page width direction and a Y axis in the page height direction may be established at the document page. Once the two-dimensional coordinate system is determined, the X-axis coordinates, Y-axis coordinates, character size, etc. of each character in the document can also be determined accordingly. In addition, for font style recognition, characters in the document may be matched to a font library, such as: the character information font characteristic data can be obtained by identifying the coordinate information covered by the handwriting of the character information, and the font style matching can be performed in the font library according to the font characteristic data, so that the font style of the character information can be obtained, and the font style in the embodiment comprises: font names, bolded fonts, inclined fonts, and scribed fonts, etc.
Step S202, extracting document contents in row units according to the coordinate positions.
According to step S201, if the two-dimensional coordinate system includes an X-axis in the page width direction and a Y-axis coordinate in the page height direction, the Y-axis coordinates of the characters of the same line are the same for the characters in the document. Therefore, in step S202, the same character as the Y-axis coordinate can be considered as the same line character, whereby the document content can be extracted in line units.
Step S203, the extracted document content is input to the parsing model to extract the title.
Since the parsing model is trained in advance and a large number of features of titles, such as font style, font size, etc., are learned, the content extracted in step S202 is taken as an input of the parsing model in units of lines, and it is possible to determine whether the input document content is a title or a text through the classification capability of the parsing model.
Further, referring to corpus 1, there may be a variety of character features for the titles in the document, for example, the sequence formats of the following three titles differ:
fifth, investment condition analysis
1. Overall analysis of external equity investment
(1) Significant equity investment
The above-described difference is generated because titles in a document have a hierarchical relationship therebetween. In the embodiment of the application, the hierarchical relationship may include subordinate relationship and parallel relationship between titles. Wherein, the subordinate relation refers to that one title (upper title) logically summarizes the content corresponding to the other title (lower title) on the content of the document; a juxtaposition refers to the fact that the content summarized by two titles is logically juxtaposed in the document.
Then, according to the characteristics that the titles in the document possibly have various character features, when the corpus is marked, different marks can be carried out on the titles with different character features, so that the analysis model can identify the titles with different character features when extracting the titles.
By using the above-described methods of steps S101-S103 and S201-S203, all titles can be extracted from a document, and at the same time, different character features of the titles can be identified and classified, and each title has the same character feature in each classification, which is referred to as a title category in the embodiment of the present application.
On the basis of extracting the title, the embodiment of the application provides a document title level analysis method.
Fig. 3 is a flowchart of a method for analyzing a document title hierarchy according to an embodiment of the present application. As shown in fig. 3, the method may include the steps of:
step S301, each title of a document is assigned a title ID, which is incremented according to the order of the titles in the document.
Specifically, all the extracted titles may be ranked according to the order of the titles in the document, and then each title may be sequentially assigned a title ID from beginning to end according to the ranking result of the titles. The title ID may be an arabic number, with the title ID increasing in ascending order with the location of the title in the document. For example, a first title in a document may be assigned a title ID of 1 and a second title assigned a title ID of 2, in sequential increments.
Step S302, determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category of the title, wherein the arrangement number is increased according to the order of the titles in the category of the title.
According to the description of the title categories, the title of each category has the same character features, and the title of different categories has different character features, so that the title can be divided into a plurality of categories according to the character features, and the titles in each category are separately ordered to determine the arrangement number of the title in the category to which the title belongs. For example, the arrangement number may also be an arabic number, and is incremented according to the order of the titles in the category to which they belong.
As an example, after the present embodiment performs step S301 and step S302 on the title, the following table can be obtained:
Figure BDA0002232557660000051
Figure BDA0002232557660000061
TABLE 1
In the above table 1, the title category is represented by lower case english letters a, b, c, d or the like, i.e., each english letter refers to one title category. For example, the titles "one", "two", "three", "four" belong to the category a, and the arrangement numbers are 1, 2, 3, and 4 in order.
It should be noted that, as shown in table 1, the titles "1.1", "1.2", "3.1", "3.2" belong to the same category c, but these titles include two sets of row numbers, namely: the arrangement numbers of the titles of 1.1 and 1.2 are 1 and 2 respectively, and the arrangement numbers of the titles of 3.1 and 3.2 are 1 and 2 respectively; this is due to: although some titles are the same in category, as can be seen by the sequence numbers 1.1 and 3.1 of the titles, the titles should belong to different chapters in a document, and the corresponding contents are relatively independent, so that the embodiment of the application adopts a mode of separately distributing the arrangement numbers to the titles "1.1", "1.2" and "3.1", "3.2".
Step S303, determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title.
As described above, the hierarchical relationship in the embodiment of the present application may include a subordinate relationship and a parallel relationship between titles, where concepts of an upper title and a lower title appear in two or more titles having a subordinate relationship, specifically, if a previous title logically summarizes a next title, and there are no other titles between the two titles that can summarize the next title, the previous title is the upper title of the next title, and the next title is the lower title of the previous title.
In the embodiment of the present application, determining the upper title ID of each title may generally include two stages.
Fig. 4 is a flowchart of a first stage of determining an upper title ID of each title provided in the embodiment of the present application. As shown in fig. 4, the first stage may specifically include the following steps:
step S401, determining the header question of each category according to the arrangement number, wherein the header question is the header with the smallest arrangement number in each category.
For example, the permutation number of the heading "one" in table 1 is a minimum value of 1, so it is a header of class a; the permutation number of the header "(one)" in table 1 is the minimum value of 1, so it is the header of the b class; the permutation number of the title "1.1" in table 1 is the minimum value of 1, so it is the header of the c-class; the permutation number of the title "3.1" in table 1 is a minimum value of 1, so it is also a header of the c-category (i.e., a plurality of headings may be included in one title category); the permutation number of the title "(1)" in table 1 is the minimum value of 1, and thus it is the head title of the class d.
Step S402, determining the upper title ID of each head title according to the sequence of the titles in the document, wherein the upper title ID is the title ID of the title before the head title.
In step S402, if it is determined that the header title is the first title in the document according to the order of the titles in the document, the upper title ID of the header title is a preset start value, for example: the preset starting value is 0.
For example, the header of "one" has a header ID of 1, and thus it is the first header in the document, and its upper header ID is a preset start value, i.e., 0; in addition, the previous title of the header title "(one)" is the title "one", and the corresponding title ID is 1, and therefore, the upper title ID of the header title "(one)" is 1; the previous title of the header title "1.1" is title "(one)", and the corresponding title ID is 2, and therefore, the upper title ID of the header title "1.1" is 2; the title preceding the head title "3.1" is title "three", and the corresponding title ID is 8, and thus, the upper title ID of the head title "3.1" is 8; the previous title of the head title "(1)" is title "3.2", and the corresponding title ID is 10, and therefore, the upper title ID of the head title "(1)" is 10.
The upper header ID of the header is updated to table 1, and the following table 2 can be obtained:
Figure BDA0002232557660000071
TABLE 2
Step S403, acquiring a first identical header of the header, and using an upper header ID of the header as an upper header ID of the corresponding first identical header, where the first identical header is adjacent to the header and has the same category.
For example, the head header "1.1" has adjacent and same category header "1.2", and thus, the header "1.2" is the first identical header of the header "1.1", and thus, the upper header ID of the head header "1.1" may be regarded as the upper header ID of the header "1.2", that is, the upper header ID of the header "1.2" is 2; the head header "3.1" has adjacent and same category header "3.2", and therefore, the header "3.2" is the first identical header of the header "3.1", and therefore, the upper header ID of the head header "3.1" can be regarded as the upper header ID of the header "3.2", that is, the upper header ID of the header "3.2" is 8; the header "(1)" has adjacent and same category header "(2)", and therefore, the header "(2)" is the first identical header of the header "(1)", and therefore, the upper header ID of the header "(1)" can be regarded as the upper header ID of the header "(2)", that is, the upper header ID of the header "(2)", as 10.
In addition, in the embodiment of the present application, other titles that are adjacent to the determined first identical title of the header title and have the same category may also be determined as the first identical title of the header title. For example, the header "(1)" already has a certain first identical title "(2)", and the first identical title "(2)" also has a title "(3)", which is adjacent and has the same category, so that the title "(3)" is also the first identical title of the header "(1)", and thus, the upper title ID of the title "(3)", is also 10.
The upper header ID of the first identical header is updated to table 2, and the following table 3 can be obtained:
Figure BDA0002232557660000081
TABLE 3 Table 3
The second stage of determining the upper title ID of each title provided in the embodiment of the present application may specifically include: for the non-header of each category except for the header, respectively obtaining the second parity header of each non-header according to the increasing sequence of the arrangement number, and using the upper header ID of the second parity header as the corresponding upper header ID of the non-header, wherein the second parity header is the header which is located before the non-header and is nearest to the non-header and has the same category.
For example, the title that is nearest to and of the same class as the non-header title "(second)" before it is "(first)", so that the title "(first)" is the second parity title of the non-header title "(second)", and the corresponding upper title ID is 1, so that the non-header title "(second)" upper title ID is 1; for another example, the title that is nearest to and of the same class as the non-header title "(third)", is "second)", and therefore the title "(second)" is the second parity title of the non-header title "(third)", and the corresponding upper title ID is 1, and therefore the non-header title "(third)", upper title ID is 1; for another example, the title closest to and of the same class as the non-header title "two" is "one", so that the title "one" is the second parity title of the non-header title "two", and the corresponding upper title ID is 0, so that the non-header title "two" upper title ID is 0; for another example, the title closest to and of the same class as the non-header title "three" is "two", so that the title "two" is the second parity title of the non-header title "three", and the corresponding upper title ID is 0, so that the non-header title "three" upper title ID is 0; for another example, the title closest to and of the same class as the non-header title "four" is "three", and thus the title "three" is the second parity title of the non-header title "four", and the corresponding upper title ID is 0, and thus the non-header title "four" upper title ID is 0.
The upper header ID of the non-header is updated to table 3, and the following table 4 can be obtained:
Figure BDA0002232557660000082
Figure BDA0002232557660000091
TABLE 4 Table 4
Therefore, the method determines the upper title IDs of all titles in the document through the two-stage analysis process, and provides a sufficient basis for analyzing the generic relationship between the titles.
Step S304, determining the collar relation between the titles according to the upper title ID.
In some embodiments, as shown in table 5, the first lower heading of the "second part" is "sixth bar", and the upper heading of the "sixth bar" determined according to step S303 is "first part", so that the heading "sixth bar" and the heading "first part" are easily mistaken for having a collar-genus relationship, and in fact, the heading "second part" is the collar-genus heading of the heading "sixth bar". Then, in order to avoid the occurrence of the above-described error, the title of the category of each title is further determined in step S304.
Figure BDA0002232557660000092
TABLE 5
Fig. 5 is a flowchart of a document title hierarchy analysis method step S304 provided in an embodiment of the present application. As shown in fig. 5, step S304 may specifically include the following steps:
step S501, determining whether there are other peer upper titles between each title and its upper title according to the upper title ID, where the upper title ID of the peer upper title is the same as the upper title ID of the upper title of the title.
For example, in table 5, the upper header of the header "first bar" is "first part", and there is no other header with the same upper header ID as "first part" between the header "first bar" and "first part", and thus there is no other upper header of the same level between the header "first bar" and "first part"; for another example, the upper header of the header "sixth" is "first part", and there is a header "second part" identical to the upper header ID of the "first part" between the "sixth" and "first part", and thus the header "second part" is the same-level upper header of the header "sixth". For another example, the upper header of the header "ninth bar" is "first part", and the header "second part" and "third part" which are identical to the upper header ID of the "first part" are present between the "ninth bar" and the "first part", and therefore, the header "second part" and the "third part" are both the same-level upper header of the header "sixth bar".
Step S5021, if there is a peer-level upper header, the peer-level upper header closest to the header is used as the upper header of the title domain.
For example, the title "second part" is an upper title of the title "sixth" genus; for another example, the title "third part" is the upper title of the same level nearest to the title "ninth" and thus the title "third part" is the upper title of the title "ninth" genus.
Step S5022, if there is no upper header of the same level, the upper header corresponding to the header is used as the upper header of the header collar.
For example, the title "first part" is an upper title of the title "first bar" genus.
Step S305, determining the hierarchy of each title according to the generic relation between the titles.
Fig. 6 is a flowchart of a method step S305 of analyzing a document title hierarchy according to an embodiment of the present application. As shown in fig. 6, step S305 may include the steps of:
step S601, a title topological structure of the document is generated according to the title and the upper title of the title and the title of the title.
Fig. 7 is a title topology generated according to table 4 according to an embodiment of the present application, and as shown in fig. 7, the title topology may be presented in the form of a title tree. In the title tree, each title is used as a node in the title tree, and the nodes are connected through a connecting line.
Step S602, determining a hierarchy of each title according to the title topology structure.
Wherein, the nodes at two ends of a connection line are in mutual subordinate relation, for example: the title "3.1" and the title "(3)" are located at both ends of one line, and therefore, the title "3.1" and the title "(3)" are subordinate relations, and the title "3.1" is an upper title of the title "(3)" field. The titles having the upper titles of the same genus are in parallel relation, for example, title "(1)", title "(2)", and title "(3)", which have the upper titles of the same genus, "3.2", and thus title "(1)", title "(2)", and title "(3)", are in parallel relation.
After determining the hierarchy of the individual titles, the technician can accurately determine the content structure of the document by title. For example: from the affiliation of the title "3.2" with the title "(1)", the title "(2)", the title "(3)", it can be determined that the title "3.2" is a summary of the contents corresponding to the title "(1)", the title "(2)", the title "(3)", and the contents corresponding to the title "(1)", the title "(2)", the title "(3)", are logically juxtaposed.
According to the technical scheme, the embodiment of the application provides a method for analyzing the document title level. The method comprises the following steps: assigning a title ID to each title of the document, the title IDs being incremented according to the order of the titles in the document; determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased according to the sequence of the titles in the category to which the title belongs; according to the title ID, category and arrangement number of the title, determining the upper title ID of each title, wherein the upper title ID is the title ID of the upper title of the title; determining the generic relationship between the titles according to the upper title ID; the hierarchy of each title is determined based on the generic relationship between the titles. Therefore, the embodiment of the application determines the hierarchy of the title according to the characteristic analysis such as the position of the title in the document, character characteristics and the like, and additional rules are not needed, so that the universality is good and the accuracy is higher.
The present application also provides an embodiment of a document title level analysis apparatus, which may be used to execute a method embodiment of the present application, and for technical details not disclosed in the apparatus embodiment of the present application, please refer to the method embodiment of the present application.
Fig. 8 is a schematic structural diagram of an analysis device for document title level according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:
a title ID generation module 701, configured to assign a title ID to each title of a document, where the title IDs increment according to the order of the titles in the document;
a rank number generating module 702, configured to determine a category of each title according to a character feature of the title, and determine a rank number of each title in a category to which the title belongs, where the rank number increases according to an order of the titles in the category to which the title belongs;
an upper header ID generation module 703, configured to determine an upper header ID of each of the headers according to the header IDs, the categories, and the arrangement numbers of the headers, where the upper header ID is a header ID of the upper header of the header;
a collar relationship determining module 704, configured to determine a collar relationship between the titles according to the upper title ID;
a document title generation module 705, configured to determine a hierarchy of each title according to a generic relationship between the titles.
According to the technical scheme, the embodiment of the application provides a document title level analysis device. The device is used for: assigning a title ID to each title of the document, the title IDs being incremented according to the order of the titles in the document; determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased according to the sequence of the titles in the category to which the title belongs; according to the title ID, category and arrangement number of the title, determining the upper title ID of each title, wherein the upper title ID is the title ID of the upper title of the title; determining the generic relationship between the titles according to the upper title ID; the hierarchy of each title is determined based on the generic relationship between the titles. Therefore, the embodiment of the application determines the hierarchy of the title according to the characteristic analysis such as the position of the title in the document, character characteristics and the like, and additional rules are not needed, so that the universality is good and the accuracy is higher.
The embodiments of the present application also provide a server, which includes a memory and a processor, where the memory stores program instructions that, when executed by the processor, cause the server to perform the methods of the above embodiments.
Embodiments of the present application also provide a computer storage medium comprising computer instructions which, when run on a user equipment, cause the user equipment to perform the methods of the above embodiments.
Embodiments of the present application also provide a computer program product for causing a computer to perform the method of the above embodiments when the computer program product is run on the computer.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (8)

1. A method of analyzing a document title hierarchy, comprising:
assigning a title ID to each title of a document, the title IDs being incremented according to the order of the titles in the document;
determining the category of each title according to the character characteristics of the title, and determining the arrangement number of each title in the category of the title, wherein the arrangement number is increased according to the sequence of the title in the category of the title;
determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title;
the determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title comprises the following steps:
determining a header question of each category according to the arrangement number, wherein the header question is a header with the smallest arrangement number in each category;
determining the upper title ID of each head title according to the sequence of the titles in the document, wherein the upper title ID is the title ID of the title before the head title;
acquiring a first synchronous title of the head title, and taking an upper title ID of the head title as an upper title ID of the corresponding first synchronous title, wherein the first synchronous title is a title which is adjacent to the head title and has the same category;
for the non-header titles except for the header title in each category, respectively acquiring a second parity title of each non-header title according to the ascending order of the arrangement number, and taking the upper title ID of the second parity title as the upper title ID of the corresponding non-header title, wherein the second parity title is the title which is positioned in front of the non-header title and is nearest to the non-header title and has the same category;
determining the generic relationship between the titles according to the upper title ID;
and determining the hierarchy of each title according to the generic relation between the titles.
2. The method as recited in claim 1, further comprising:
if it is determined that the header title is the first title in the document according to the order of the titles in the document, the upper title ID of the header title is a preset start value.
3. The method of claim 1, wherein said determining the generic relationship between the titles from the upper title ID comprises:
judging whether other upper titles in the same level exist between each title and the upper title according to the upper title ID, wherein the upper title ID of the upper title in the same level is the same as the upper title ID of the upper title of the title;
if the same-level upper-level title exists, taking the upper-level title closest to the title as the upper-level title of the title domain;
and if the same-level upper-level title does not exist, taking the upper-level title corresponding to the title as the upper-level title of the title domain.
4. A method according to claim 3, wherein said determining the hierarchy of each of said titles from the generic relationship between said titles comprises:
generating a title topological structure of the document according to the title and the upper title of the title;
and determining the hierarchy of each title according to the title topological structure.
5. The method of claim 1, wherein assigning a title ID to each title of a document, the title IDs being prior to incrementing the title in the document according to the order in which the titles are in the document, further comprises:
extracting a title feature set and a text feature set from a known document corpus;
training the title feature set and the text feature set to obtain an analytic model based on a machine learning classification algorithm;
the title is extracted from the document using the analytical model.
6. The method of claim 5, wherein extracting the title from the document using the analytical model comprises:
acquiring the coordinate position of each character in a document;
extracting document content in units of rows according to the coordinate positions;
the extracted document content is input to a parsing model to extract the title.
7. An apparatus for analyzing a document title hierarchy, comprising:
a title ID generation module, configured to assign a title ID to each title of a document, where the title IDs increment according to the order of the titles in the document;
the arrangement number generation module is used for determining the category of each title according to the character characteristics of the title and determining the arrangement number of each title in the category of the title, wherein the arrangement number is increased according to the sequence of the title in the category of the title;
the upper title ID generation module is used for determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title;
the upper title ID generation module is specifically configured to: determining a header question of each category according to the arrangement number, wherein the header question is a header with the smallest arrangement number in each category;
determining the upper title ID of each head title according to the sequence of the titles in the document, wherein the upper title ID is the title ID of the title before the head title;
acquiring a first synchronous title of the head title, and taking an upper title ID of the head title as an upper title ID of the corresponding first synchronous title, wherein the first synchronous title is a title which is adjacent to the head title and has the same category;
for the non-header titles except for the header title in each category, respectively acquiring a second parity title of each non-header title according to the ascending order of the arrangement number, and taking the upper title ID of the second parity title as the upper title ID of the corresponding non-header title, wherein the second parity title is the title which is positioned in front of the non-header title and is nearest to the non-header title and has the same category;
the title relationship determining module is used for determining the title relationship between the titles according to the upper title ID;
and the document title generation module is used for determining the hierarchy of each title according to the generic relationship between the titles.
8. A server comprising a memory and a processor, the memory storing program instructions that, when executed by the processor, cause the server to perform the method of any of claims 1-6.
CN201910972519.9A 2019-10-14 2019-10-14 Analysis method, device and server for document title level Active CN110688842B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910972519.9A CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910972519.9A CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Publications (2)

Publication Number Publication Date
CN110688842A CN110688842A (en) 2020-01-14
CN110688842B true CN110688842B (en) 2023-06-09

Family

ID=69112391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910972519.9A Active CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Country Status (1)

Country Link
CN (1) CN110688842B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723551A (en) * 2020-06-16 2020-09-29 北京双泽维度信息技术有限公司 Document title structure tree generation method, device and system
CN112380873B (en) * 2020-12-04 2024-04-26 鼎富智能科技有限公司 Method and device for determining selected items in specification document

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006338178A (en) * 2005-05-31 2006-12-14 Sony Corp Hierarchical-structure menu displaying method, hierarchical-structure menu displaying device, and hierarchical-structure menu displaying program
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109670162A (en) * 2017-10-13 2019-04-23 北大方正集团有限公司 The determination method, apparatus and terminal device of title

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978264B2 (en) * 2002-01-03 2005-12-20 Microsoft Corporation System and method for performing a search and a browse on a query
JP5417471B2 (en) * 2012-03-14 2014-02-12 株式会社東芝 Structured document management apparatus and structured document search method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006338178A (en) * 2005-05-31 2006-12-14 Sony Corp Hierarchical-structure menu displaying method, hierarchical-structure menu displaying device, and hierarchical-structure menu displaying program
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109670162A (en) * 2017-10-13 2019-04-23 北大方正集团有限公司 The determination method, apparatus and terminal device of title

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
学术文本的结构功能识别――功能框架及基于章节标题的识别;陆伟等;《情报学报》;20140924(第09期) *

Also Published As

Publication number Publication date
CN110688842A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
US7836390B2 (en) Strategies for processing annotations
US10049096B2 (en) System and method of template creation for a data extraction tool
US9384389B1 (en) Detecting errors in recognized text
CN111512315A (en) Block-wise extraction of document metadata
CN112016273A (en) Document directory generation method and device, electronic equipment and readable storage medium
CN111680634A (en) Document file processing method and device, computer equipment and storage medium
CN110427488B (en) Document processing method and device
CN110688842B (en) Analysis method, device and server for document title level
US11568666B2 (en) Method and system for human-vision-like scans of unstructured text data to detect information-of-interest
US20230206670A1 (en) Semantic representation of text in document
JP5380040B2 (en) Document processing device
CN108959204B (en) Internet financial project information extraction method and system
CN113962201A (en) Document structuralization and extraction method for documents
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
Meuschke et al. A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents
CN116110051B (en) File information processing method and device, computer equipment and storage medium
EP4167122A1 (en) Extracting key value pairs using positional coordinates
CN115063784A (en) Bill image information extraction method and device, storage medium and electronic equipment
CN111125483B (en) Webpage data extraction template generation method and device, computer device and storage medium
US20160203133A1 (en) Systems and methods for indexing and linking electronic documents
CN113515705A (en) Response information generation method, device, equipment and computer readable storage medium
US11256760B1 (en) Region adjacent subgraph isomorphism for layout clustering in document images
JP4466241B2 (en) Document processing method and document processing apparatus
CN112257400A (en) Table data extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant