CN110688842A - Document title level analysis method and device and server - Google Patents

Document title level analysis method and device and server Download PDF

Info

Publication number
CN110688842A
CN110688842A CN201910972519.9A CN201910972519A CN110688842A CN 110688842 A CN110688842 A CN 110688842A CN 201910972519 A CN201910972519 A CN 201910972519A CN 110688842 A CN110688842 A CN 110688842A
Authority
CN
China
Prior art keywords
title
titles
document
category
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910972519.9A
Other languages
Chinese (zh)
Other versions
CN110688842B (en
Inventor
任宁
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Dingfu (beijing) Science And Technology Development Co Ltd
Original Assignee
Zhongke Dingfu (beijing) Science And Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Dingfu (beijing) Science And Technology Development Co Ltd filed Critical Zhongke Dingfu (beijing) Science And Technology Development Co Ltd
Priority to CN201910972519.9A priority Critical patent/CN110688842B/en
Publication of CN110688842A publication Critical patent/CN110688842A/en
Application granted granted Critical
Publication of CN110688842B publication Critical patent/CN110688842B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method and a device for analyzing a document title level and a server. The method comprises the following steps: assigning a title ID to each title of the document, the title ID increasing according to the order of the titles in the document; determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased progressively according to the sequence of the title in the category to which the title belongs; determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; determining the leading relationship between the titles according to the upper title ID; the hierarchy of each title is determined according to the leading relationship between the titles. Therefore, the embodiment of the application is based on the fact that the hierarchy of the title is determined through characteristic analysis of the position of the title in the document, character features and the like, and no additional rule is needed, so that the universality is good, and the accuracy is higher.

Description

Document title level analysis method and device and server
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for analyzing a document title hierarchy, and a server.
Background
Electronic documents, such as PDF documents, word documents, RTF (Rich Text Format) documents, HTML (HyperText Markup Language) documents, and the like, are the main media forms for carrying information in various computer systems, and are widely used. Therefore, extracting valuable information from electronic documents has become a research hotspot in recent years from the technical field of natural language processing.
Taking the example of extracting document titles from electronic documents and determining the title hierarchy, a method of identifying titles based on rules is commonly used at present. The method establishes some extraction rules for the title according to the difference between the text style of the title and the text style of the body, extracts the title from the electronic document by using the extraction rules and determines the title level. However, such a rule-based method has high requirements for rule formulation, and conflicts between rules are likely to occur, so that it is difficult to improve the accuracy of title recognition. In addition, the rule-based method has no universality, and when the text styles of different electronic documents are various, extraction rules must be formulated respectively and correspondingly, so that the development cost is high. In addition, the non-standardized format of some electronic documents (e.g., PDF documents formed by scanning and photocopying, word documents converted by some tools, etc.) also affects the accuracy of current rule-based methods.
Disclosure of Invention
The embodiment of the application provides a method, a device and a server for analyzing a document title level, and aims to solve the problems of poor universality and low accuracy of rule-based title level extraction in the prior art.
In a first aspect, an embodiment of the present application provides a method for analyzing a document title hierarchy, including: assigning a title ID to each title of a document, the title ID increasing according to the order of the title in the document; determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement numbers are increased according to the sequence of the titles in the category to which the title belongs; determining a top title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the top title ID is the title ID of the top title of the title; determining the leading relationship among the titles according to the upper title ID; and determining the hierarchy of each title according to the leading relationship among the titles.
In a second aspect, an embodiment of the present application provides an apparatus for analyzing a document title hierarchy, including: the title ID generation module is used for distributing title IDs for all titles of the documents, and the title IDs are increased progressively according to the sequence of the titles in the documents; the arrangement number generating module is used for determining the category of each title according to the character features of the titles and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement numbers are increased progressively according to the sequence of the titles in the category to which the titles belong; an upper title ID generation module for determining an upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; a leading relationship determining module, configured to determine a leading relationship between the titles according to the upper title ID; and the document title generating module is used for determining the hierarchy of each title according to the leading relationship among the titles.
In a third aspect, an embodiment of the present application provides a server, including: a memory and a processor, the memory storing program instructions that, when executed by the processor, cause the server to perform the method of any of the above aspects.
According to the technical scheme, the title ID can be distributed to each title of the document, and the title ID is increased progressively according to the sequence of the title in the document; determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased progressively according to the sequence of the title in the category to which the title belongs; determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; determining the leading relationship between the titles according to the upper title ID; the hierarchy of each title is determined according to the leading relationship between the titles. Therefore, the embodiment of the application is based on the fact that the hierarchy of the title is determined through characteristic analysis of the position of the title in the document, character features and the like, and no additional rule is needed, so that the universality is good, and the accuracy is higher.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a flow chart of extracting a title from a document provided by an embodiment of the present application;
FIG. 2 is a flowchart of step S103 of extracting a title from a document according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for analyzing a document title hierarchy according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of a first stage of determining a high-level title ID for each title provided by an embodiment of the present application;
FIG. 5 is a flowchart of a document title hierarchy analysis method step S304 according to an embodiment of the present application;
FIG. 6 is a flowchart of a step S305 of a method for analyzing a document title hierarchy according to an embodiment of the present application;
FIG. 7 is a header topology generated according to Table 4 in an embodiment of the present application;
fig. 8 is a schematic structural diagram of an analysis apparatus for a document title hierarchy according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Electronic documents, such as PDF documents, word documents, RTF (Rich Text Format) documents, HTML (HyperText Markup Language) documents, and the like, are the main media forms for carrying information in various computer systems, and are widely used. Therefore, extracting valuable information from electronic documents has become a research hotspot in recent years from the technical field of natural language processing.
In a document, because a title is typically a summary of one or more paragraph content topics that follow it, the extraction and analysis of the title is one of the primary directions in which information is obtained from the document. In general, the extraction and analysis of a title in a document may include two stages: the first stage, identifying and extracting the title from the document; in the second stage, the extracted titles are further analyzed to determine the leadership between titles and the hierarchy of titles.
The embodiment of the present application specifically explains and explains the technical solution of the embodiment of the present application according to the two stages listed above, respectively.
In the first stage, a title is identified and extracted from a document.
FIG. 1 is a flowchart of extracting a title from a document according to an embodiment of the present application. As shown in fig. 1, the first stage of the embodiment of the present application for extracting the title from the document includes the following steps:
step S101, extracting a title feature set and a text feature set from a known document corpus.
The known document corpus refers to a document corpus of which contents are known as titles and which contents are texts, and the known document corpus can be obtained by collecting in a historical document title extraction task, and can also be obtained by labeling the unknown document corpus in a corpus labeling mode. The title feature set comprises title content of known corpus, and the text feature set comprises text content of known corpus.
Illustratively, for the document corpus noted below (hereinafter referred to as corpus 1):
(V) analysis of the investment status/title @
1. Stock investing in total analysis/title @
□ applicable √ not applicable
(1) Significant equity investing/title-
□ applicable √ not applicable
(2) The heavy non-equity investment/title @
□ applicable √ not applicable
(3) Financial asset/title measured in fair value
□ applicable √ not applicable
(VI) major assets and equity sale/title @
Suitability for □
On 31/10/2018, a company signed a collaboration framework agreement on XXX with XXX investment (shanghai) limited (hereinafter "XXX"): the detailed content is disclosed in the bulletin about interest in selling assets, which is disclosed in 2018, 11, 1
In corpus 1, with paragraphs as labeling units, paragraphs belonging to a title are labeled "/title/", paragraph contents labeled "/title/" are added to the title feature set, and contents of the remaining paragraphs are added to the body feature set.
And S102, training by using the title feature set and the text feature set to obtain an analysis model based on a machine learning classification algorithm.
The analytic model may be, for example, a Support Vector Machine (SVM) algorithm model, which may implement a classification capability for an unknown sample through corpus training of multiple categories. For example, when the algorithm model is trained using the title feature set and the body feature set, the algorithm model can be provided with the capability of identifying the title and the body in the unknown document. Specifically, the heading feature set and the text feature set input as training inputs of the algorithm model may include features of a font style, a font size, a font height, a font width, and the like of a corresponding heading and a text, so that the algorithm model learns the features in the training process and has the capability of identifying the heading and the text in an unknown document according to the features.
Step S103, extracting the title from the document by using the analytic model.
When the parsing model is used to extract the title from the document, the document needs to be processed to obtain a format acceptable by the parsing model, then step S103, as shown in fig. 2, may specifically include the following steps:
in step S201, the coordinate position of each character in the document is acquired.
Specifically, character parsing may be performed on the whole document in units of characters to obtain character features such as the positions of the characters (for example, the X-axis coordinates and the Y-axis coordinates of the characters in the document page), the character size, the character style, and the like, wherein the X-axis coordinates may be coordinates along the width direction of the document page, and the Y-axis coordinates may be coordinates along the height direction of the document page.
In order to obtain the above-described features, a two-dimensional coordinate system including an X-axis in the page width direction and a Y-axis in the page height direction may be established on the document page. Once the two-dimensional coordinate system is determined, the X-axis coordinates, Y-axis coordinates, character size, etc. of each character in the document can be determined accordingly. Additionally, for recognition of font styles, characters in a document may be matched against a font library, such as: by identifying coordinate information covered by the handwriting of the character information, font characteristic data of the character information can be obtained, and font style matching is performed in a font library according to the font characteristic data, so that the font style of the character information can be obtained, wherein the font style in the embodiment comprises the following steps: font name, bolded font, slanted font, and underlined font, etc.
Step S202, extracting the document content in a row unit according to the coordinate position.
According to step S201, if the two-dimensional coordinate system includes X-axis coordinates in the page width direction and Y-axis coordinates in the page height direction, the Y-axis coordinates of the characters of the same line are the same for the characters in the document. Therefore, in step S202, the characters having the same Y-axis coordinate can be regarded as the same line of characters, and thus the document contents can be extracted in line units.
Step S203, the extracted document content is input to an analytic model to extract the title.
Since the analysis model is trained in advance and the characteristics of a large number of titles, such as font style, font size, etc., are learned, the input of the analysis model in units of lines of the content extracted in step S202 can determine whether the input document content is a title or a text through the classification capability of the analysis model.
Further, referring to corpus 1, the headings in the document may have various character features, such as the following three headings with different sequence formats:
(V) analysis of investment situation
1. Gross analysis of external equity investment
(1) Significant equity investment
The reason for the difference is that there is a hierarchical relationship between the titles in the documents. The hierarchical relationship in the embodiment of the present application may include a subordinate relationship, a parallel relationship, and the like between titles. Wherein, the affiliation means that one title (upper title) logically summarizes the content corresponding to another title (lower title) on the content of the document; a juxtaposition means that the contents summarized by the two headings are logically juxtaposed in the document.
Then, according to the characteristic that the title in the document may have multiple character features, different labels can be performed on the titles with different character features when the corpus is labeled, so that the titles with different character features can be identified when the analytic model extracts the titles.
By using the methods of the above steps S101-S103, S201-S203, all titles can be extracted from the document, and at the same time, different character features of the titles can be identified and classified, and the titles in each classification have the same character feature, which is referred to as a title classification in the embodiments of the present application.
On the basis of title extraction, the embodiment of the application provides a method for analyzing a document title hierarchy.
Fig. 3 is a flowchart of a method for analyzing a document title hierarchy according to an embodiment of the present disclosure. As shown in fig. 3, the method may include the steps of:
step S301, assigning a title ID to each title of the document, wherein the title ID is increased according to the sequence of the title in the document.
Specifically, all the extracted titles may be sorted according to the order of the titles in the document, and then a title ID is assigned to each title in turn from the beginning to the end according to the sorting result of the titles. The title ID may be an arabic number that increments in ascending order with the position of the title in the document. For example, the first title in the document may be assigned a title ID of 1, and the second title may be assigned a title ID of 2, which are sequentially incremented.
Step S302, determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased progressively according to the sequence of the title in the category to which the title belongs.
According to the above description of the title categories, the titles of each category have the same character characteristics, and the titles of different categories have different character characteristics, so that the titles can be divided into a plurality of categories according to the character characteristics, and the titles in each category are separately sorted to determine the number of the titles in the category to which the titles belong. For example, the arrangement number may also be an arabic number and be incremented according to the order of the titles in the category to which they belong.
By way of example, after the embodiment of the present application performs step S301 and step S302 on the title, the following table can be obtained:
Figure BDA0002232557660000051
Figure BDA0002232557660000061
TABLE 1
In table 1 above, the title categories are represented by lower case english letters a, b, c, d, etc., i.e., each english letter refers to one title category. For example, the titles "one", "two", "three" and "four" belong to the category a, and the arrangement numbers are 1, 2, 3 and 4 in sequence.
It should be noted that, as shown in table 1, the titles "1.1", "1.2", "3.1" and "3.2" belong to the category c, but these titles include two sets of arrangement numbers, namely: the arrangement numbers of the titles "1.1" and "1.2" are 1 and 2, respectively, and the arrangement numbers of the titles "3.1" and "3.2" are 1 and 2, respectively; this is due to: although some titles have the same category, it can be seen from the serial numbers 1.1 and 3.1 of the titles that these titles should belong to different chapters in one document, and the corresponding contents are relatively independent, so that the embodiments of the present application adopt a separate manner of assigning the arrangement numbers to the titles "1.1", "1.2" and "3.1" and "3.2".
Step S303, determining a top title ID of each title according to the title ID, the category, and the arrangement number of the title, where the top title ID is the title ID of the top title of the title.
As described above, the hierarchical relationship in the embodiment of the present application may include an affiliation and a juxtaposition between titles, and the like, where concepts of a top title and a bottom title appear in two or more titles having an affiliation, and specifically, if a top title logically summarizes a bottom title and there is no other title between the two titles that can summarize the bottom title, the top title of the top title is the top title of the bottom title, and the bottom title is the bottom title of the top title.
In the embodiment of the present application, determining the upper title ID of each title may generally include two stages.
Fig. 4 is a flowchart of a first stage of determining a high-level title ID of each title according to an embodiment of the present application. As shown in fig. 4, the first stage may specifically include the following steps:
step S401, determining the head title of each category according to the arrangement number, wherein the head title is the title with the minimum arrangement number in each category.
For example, the arrangement number of the title "one" in table 1 is the minimum value of 1, and thus it is the header title of the a-category; the arrangement number of the title "(one)" in table 1 is the minimum value of 1, and thus it is a header title of b category; the arrangement number of the title "1.1" in table 1 is the minimum value of 1, and thus it is a header title of the c category; the ranking number of the heading "3.1" in table 1 is the minimum value of 1, and thus it is also the head heading of the category c (i.e.: a plurality of head headings may be included in one heading category); the ranking number of the title "(1)" in table 1 is the minimum value of 1, and thus it is the head title of the d category.
Step S402, determining the superior title ID of each header according to the sequence of the headers in the document, wherein the superior title ID is the title ID of the previous header of the header.
In step S402, if it is determined that the heading is the first heading in the document according to the sequence of the headings in the document, the top heading ID of the heading is a preset starting value, for example: the preset initial value is 0.
For example, the title ID of the header title "one" is 1, and thus, it is the first title in the document, and then its upper title ID is a preset start value, i.e., 0; in addition, since the title immediately before the first title "(one)" is the title "one" and the corresponding title ID is 1, the upper title ID of the first title "(one)" is 1; the title preceding the head title "1.1" is title "(one)" and the corresponding title ID is 2, and therefore, the upper title ID of the head title "1.1" is 2; the title preceding the first title "3.1" is title "three", and the corresponding title ID is 8, and therefore, the upper title ID of the first title "3.1" is 8; since the title immediately preceding the first title "(1)" is the title "3.2" and the corresponding title ID is 10, the upper title ID of the first title "(1)" is 10.
By updating the upper title ID of the header to table 1, the following table 2 can be obtained:
Figure BDA0002232557660000071
TABLE 2
Step S403, acquiring a first co-located title of the header, and using a higher title ID of the header as a higher title ID of the corresponding first co-located title, where the first co-located title is a title that is adjacent to the header and has the same category as the header.
For example, the head title "1.1" has the title "1.2" adjacent to and having the same category, and therefore, the title "1.2" is the first parity title of the head title "1.1", and therefore, the upper title ID of the head title "1.1" can be regarded as the upper title ID of the title "1.2", that is, the upper title ID of the title "1.2" is 2; the head title "3.1" has the title "3.2" adjacent and of the same category, and therefore, the title "3.2" is the first parity title of the head title "3.1", and therefore, the upper title ID of the head title "3.1" can be regarded as the upper title ID of the title "3.2", that is, the upper title ID of the title "3.2" is 8; the headings "(1)" have adjacent headings "(2)" having the same category, and therefore, the heading "(2)" is the first parity heading of the headings "(1)", and therefore, the upper heading ID of the headings "(1)" can be set as the upper heading ID of the heading "(2)", that is, the upper heading ID of the heading "(2)" is 10.
In the embodiment of the present application, another title that is adjacent to the first co-located title already determined and has the same category may be determined as the first co-located title of the head title. For example, the heading "(1)" already has a certain first homogeneous title "(2)" and the first homogeneous title "(2)" also has a title "(3)" adjacent and of the same category, so the title "(3)" is also the first co-located title of the heading "(1)" and therefore the upper title ID of the title "(3)" is also 10.
By updating the upper header ID of the first parity header to table 2, the following table 3 can be obtained:
Figure BDA0002232557660000081
TABLE 3
The second stage of determining the upper title ID of each title provided in the embodiment of the present application may specifically include: for non-header titles in each category except the header title, respectively acquiring a second co-located title of each non-header title according to the ascending order of the arrangement number, and using the upper title ID of the second co-located title as the upper title ID of the corresponding non-header title, wherein the second co-located title is the title which is positioned before the non-header title, is most adjacent to the non-header title and has the same category.
For example, the title immediately preceding the non-header title "(two)" and having the same category as the title immediately preceding the non-header title "(two)" is "(one)", and therefore the title "(one)" is the second parity title of the non-header title "(two)", and the corresponding upper title ID is 1, and therefore the upper title ID of the non-header title "(two)" is 1; for another example, since the title "two" which is the nearest to the non-header title "(three)" and has the same category is preceded by the non-header title "(three)", the title "two" is the second parity title of the non-header title "(three)", and the corresponding upper title ID is 1, the upper title ID of the non-header title "(three)" is 1; for another example, since the title closest to and having the same category as the non-header title "two" is "one", the title "one" is the second parity title of the non-header title "two", and the corresponding upper title ID is 0, the upper title ID of the non-header title "two" is 0; for another example, since the title "two" which is the nearest to the non-header title "three" and has the same category is "two", the title "two" is the second parity title of the non-header title "three", and the corresponding upper title ID is 0, the upper title ID of the non-header title "three" is 0; for another example, since the title closest to and of the same category as the non-header title "four" is "three", the title "three" is the second parity title of the non-header title "four", and the corresponding upper title ID is 0, the upper title ID of the non-header title "four" is 0.
By updating the upper title ID of the non-top title to table 3, the following table 4 can be obtained:
Figure BDA0002232557660000082
Figure BDA0002232557660000091
TABLE 4
Therefore, the method and the device determine the upper title IDs of all titles in the document through two stages of analysis processes, and provide sufficient basis for analyzing the belonging relation between the titles.
And step S304, determining the belonging relation among the titles according to the upper title ID.
In some embodiments, as shown in table 5, the first lower title under the "second part" is "sixth", and the upper title of the "sixth" determined according to step S303 is "first part", and therefore, the title "sixth" and the title "first part" are easily mistaken for having a leading relationship, and in fact, the title "second part" is the leading title of the title "sixth". Then, in order to avoid the above error occurrence, the leading title of each title is further determined in step S304.
Figure BDA0002232557660000092
TABLE 5
Fig. 5 is a flowchart of a step S304 of a document title hierarchy analysis method provided in an embodiment of the present application. As shown in fig. 5, step S304 may specifically include the following steps:
step S501, determining whether there is another higher-ranked title between each title and the higher-ranked title according to the higher-ranked title ID, where the higher-ranked title ID of the higher-ranked title is the same as the higher-ranked title ID of the higher-ranked title of the title.
For example, in table 5, the higher-level title of the title "first bar" is "first part", and there is no other title between "first bar" and "first part" that is the same as the higher-level title ID of "first part", and therefore there is no other higher-level title between "first bar" and "first part"; for another example, the upper title of the title "sixth" is "first part", and a title "second part" identical to the upper title ID of "first part" exists between "sixth" and "first part", and thus the title "second part" is the upper title of the same level as the title "sixth"; for another example, the upper title of the title "ninth bar" is "first part", and "second part" and "third part" of the title, which are the same as the upper title ID of the "first part", exist between the "ninth bar" and the "first part", and therefore, the titles "second part" and "third part" are both the same-level upper titles of the title "sixth bar".
Step S5021, if there is a higher-ranked title in the same level, the higher-ranked title in the same level closest to the title is used as the higher-ranked title of the title leader.
For example, the title "second part" is a top title of the title "sixth" leading; for another example, the title "third section" is a top-level title closest to the title "ninth bar", and therefore, the title "third section" is a top-level title belonging to the title "ninth bar".
Step S5022, if there is no higher-level title in the same level, the higher-level title corresponding to the title is used as the higher-level title belonging to the title.
For example, the title "first part" is a top-level title belonging to the title "first bar".
Step S305, determining the hierarchy of each title according to the leading relationship among the titles.
Fig. 6 is a flowchart of a step S305 of a method for analyzing a document title hierarchy according to an embodiment of the present application. As shown in fig. 6, step S305 may include the steps of:
step S601, generating a title topological structure of the document according to the title and the upper title of the leader thereof.
Fig. 7 is a title topology generated according to table 4 in the embodiment of the present application, and as shown in fig. 7, the title topology may be presented in the form of a title tree. In the title tree, each title is used as a node in the title tree, and the nodes are connected through a connecting line.
Step S602, determining a hierarchy of each title according to the title topology.
The nodes at two ends of a connection line are in a mutual dependency relationship, for example: the title "3.1" and the title "(3)" are located at both ends of a connecting line, and therefore, the title "3.1" and the title "(3)" are subordinate, and the title "3.1" is a top title of the title "(3)" leading. The titles having the upper headings of the same leading genus are in a parallel relationship, and for example, the title "(1)", the title "(2)", the title "(3)" have the upper title "3.2" of the same leading genus, and therefore, the title "(1)", the title "(2)", and the title "(3)" are in a parallel relationship.
After determining the hierarchy of each title, a technician can accurately determine the content structure of the document by title. For example: from the subordination relationship between the title "3.2" and the title "(1)", the title "(2)", the title "(3)", it can be determined that the title "3.2" is a summary of the contents corresponding to the title "(1)", the title "(2)", the title "(3)", and the contents corresponding to the title "(1)", the title "(2)", the title "(3)" are logically juxtaposed.
According to the technical scheme, the embodiment of the application provides a method for analyzing the document title level. The method comprises the following steps: assigning a title ID to each title of the document, the title ID increasing according to the order of the titles in the document; determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased progressively according to the sequence of the title in the category to which the title belongs; determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; determining the leading relationship between the titles according to the upper title ID; the hierarchy of each title is determined according to the leading relationship between the titles. Therefore, the embodiment of the application is based on the fact that the hierarchy of the title is determined through characteristic analysis of the position of the title in the document, character features and the like, and no additional rule is needed, so that the universality is good, and the accuracy is higher.
The present application further provides an embodiment of a document title level analysis apparatus, which may be used to execute the method embodiments of the present application, and please refer to the method embodiments of the present application for technical details that are not disclosed in the apparatus embodiments of the present application.
Fig. 8 is a schematic structural diagram of an analysis apparatus for a document title hierarchy according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:
a title ID generation module 701, configured to assign a title ID to each title of a document, where the title ID increases progressively according to an order of the title in the document;
a ranking number generating module 702, configured to determine a category of each title according to the character features of the titles, and determine a ranking number of each title in the category to which the title belongs, where the ranking number increases progressively according to the order of the title in the category to which the title belongs;
an upper title ID generation module 703, configured to determine an upper title ID of each title according to the title ID, the category, and the arrangement number of the title, where the upper title ID is the title ID of the upper title of the title;
a leading relationship determining module 704, configured to determine a leading relationship between the titles according to the upper title ID;
the document title generating module 705 is configured to determine a hierarchy of each title according to the leading relationship between the titles.
According to the technical scheme, the embodiment of the application provides a document title level analysis device. The device is used for: assigning a title ID to each title of the document, the title ID increasing according to the order of the titles in the document; determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement number is increased progressively according to the sequence of the title in the category to which the title belongs; determining the upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title; determining the leading relationship between the titles according to the upper title ID; the hierarchy of each title is determined according to the leading relationship between the titles. Therefore, the embodiment of the application is based on the fact that the hierarchy of the title is determined through characteristic analysis of the position of the title in the document, character features and the like, and no additional rule is needed, so that the universality is good, and the accuracy is higher.
Embodiments of the present application further provide a server, where the server includes a memory and a processor, where the memory stores program instructions, and when the program instructions are executed by the processor, the server performs the methods of the foregoing embodiments.
The embodiment of the present application further provides a computer storage medium, where the computer storage medium includes computer instructions, and when the computer instructions are run on a user equipment, the user equipment is caused to execute the methods of the foregoing embodiments.
The embodiments of the present application also provide a computer program product, which when running on a computer, causes the computer to execute the methods of the above embodiments.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for analyzing a document title hierarchy, comprising:
assigning a title ID to each title of a document, the title ID increasing according to the order of the title in the document;
determining the category of each title according to the character features of the titles, and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement numbers are increased according to the sequence of the titles in the category to which the title belongs;
determining a top title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the top title ID is the title ID of the top title of the title;
determining the leading relationship among the titles according to the upper title ID;
and determining the hierarchy of each title according to the leading relationship among the titles.
2. The method of claim 1, wherein determining the top title ID of each title according to the title ID, the category, and the arrangement number of the title comprises:
determining a head title of each category according to the arrangement number, wherein the head title is the title with the minimum arrangement number in each category;
determining the superior title ID of each header according to the sequence of the headers in the document, wherein the superior title ID is the title ID of the previous header of the header;
and acquiring a first co-located title of the header title, and using a higher title ID of the header title as a higher title ID of the corresponding first co-located title, wherein the first co-located title is a title which is adjacent to the header title in position and has the same category.
3. The method of claim 2, further comprising:
and if the header is determined to be the first header in the document according to the sequence of the headers in the document, the upper header ID of the header is a preset initial value.
4. The method of claim 2 or 3, further comprising:
for non-header titles in each category except the header title, respectively acquiring a second co-located title of each non-header title according to the ascending order of the arrangement number, and using the upper title ID of the second co-located title as the upper title ID of the corresponding non-header title, wherein the second co-located title is the title which is positioned before the non-header title, is most adjacent to the non-header title and has the same category.
5. The method according to claim 4, wherein the determining the leading relationship between the titles according to the upper title IDs comprises:
judging whether other higher-level titles exist between each title and the higher-level title according to the higher-level title ID, wherein the higher-level title ID of the higher-level title is the same as the higher-level title ID of the higher-level title of the title;
if the higher-level titles at the same level exist, the higher-level title at the same level closest to the title is used as the higher-level title belonging to the title;
and if the upper titles at the same level do not exist, taking the upper title corresponding to the title as the upper title belonging to the title.
6. The method of claim 5, wherein determining the hierarchy of each title according to the leading relationship between the titles comprises:
generating a title topological structure of the document according to the title and the upper title of the leader thereof;
determining a hierarchy for each of the titles according to the title topology.
7. The method of claim 1, wherein assigning title IDs to respective titles of the document, the title IDs being prior to increasing in order of title in the document, further comprises:
extracting a title feature set and a text feature set from a known document corpus;
training by using the title feature set and the text feature set to obtain an analysis model based on a machine learning classification algorithm;
the title is extracted from the document using the parsing model.
8. The method of claim 7, wherein extracting the title from the document using the parsing model comprises:
acquiring the coordinate position of each character in the document;
extracting document contents in a unit of a line according to the coordinate position;
the extracted document content is input to a parsing model to extract the title.
9. An apparatus for analyzing a document title hierarchy, comprising:
the title ID generation module is used for distributing title IDs for all titles of the documents, and the title IDs are increased progressively according to the sequence of the titles in the documents;
the arrangement number generating module is used for determining the category of each title according to the character features of the titles and determining the arrangement number of each title in the category to which the title belongs, wherein the arrangement numbers are increased progressively according to the sequence of the titles in the category to which the titles belong;
an upper title ID generation module for determining an upper title ID of each title according to the title ID, the category and the arrangement number of the title, wherein the upper title ID is the title ID of the upper title of the title;
a leading relationship determining module, configured to determine a leading relationship between the titles according to the upper title ID;
and the document title generating module is used for determining the hierarchy of each title according to the leading relationship among the titles.
10. A server, characterized in that the server comprises a memory and a processor, the memory storing program instructions which, when executed by the processor, cause the server to perform the method of any one of claims 1-8.
CN201910972519.9A 2019-10-14 2019-10-14 Analysis method, device and server for document title level Active CN110688842B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910972519.9A CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910972519.9A CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Publications (2)

Publication Number Publication Date
CN110688842A true CN110688842A (en) 2020-01-14
CN110688842B CN110688842B (en) 2023-06-09

Family

ID=69112391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910972519.9A Active CN110688842B (en) 2019-10-14 2019-10-14 Analysis method, device and server for document title level

Country Status (1)

Country Link
CN (1) CN110688842B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723551A (en) * 2020-06-16 2020-09-29 北京双泽维度信息技术有限公司 Document title structure tree generation method, device and system
CN112380873A (en) * 2020-12-04 2021-02-19 鼎富智能科技有限公司 Method and device for determining selected item in standard document

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126235A1 (en) * 2002-01-03 2003-07-03 Microsoft Corporation System and method for performing a search and a browse on a query
JP2006338178A (en) * 2005-05-31 2006-12-14 Sony Corp Hierarchical-structure menu displaying method, hierarchical-structure menu displaying device, and hierarchical-structure menu displaying program
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
US20130268554A1 (en) * 2012-03-14 2013-10-10 Toshiba Solutions Corporation Structured document management apparatus and structured document search method
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109670162A (en) * 2017-10-13 2019-04-23 北大方正集团有限公司 The determination method, apparatus and terminal device of title

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126235A1 (en) * 2002-01-03 2003-07-03 Microsoft Corporation System and method for performing a search and a browse on a query
JP2006338178A (en) * 2005-05-31 2006-12-14 Sony Corp Hierarchical-structure menu displaying method, hierarchical-structure menu displaying device, and hierarchical-structure menu displaying program
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
US20130268554A1 (en) * 2012-03-14 2013-10-10 Toshiba Solutions Corporation Structured document management apparatus and structured document search method
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109670162A (en) * 2017-10-13 2019-04-23 北大方正集团有限公司 The determination method, apparatus and terminal device of title

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆伟等: "学术文本的结构功能识别――功能框架及基于章节标题的识别", 《情报学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723551A (en) * 2020-06-16 2020-09-29 北京双泽维度信息技术有限公司 Document title structure tree generation method, device and system
CN112380873A (en) * 2020-12-04 2021-02-19 鼎富智能科技有限公司 Method and device for determining selected item in standard document
CN112380873B (en) * 2020-12-04 2024-04-26 鼎富智能科技有限公司 Method and device for determining selected items in specification document

Also Published As

Publication number Publication date
CN110688842B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
US7836390B2 (en) Strategies for processing annotations
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
CN111512315A (en) Block-wise extraction of document metadata
US20230206670A1 (en) Semantic representation of text in document
US11256760B1 (en) Region adjacent subgraph isomorphism for layout clustering in document images
US11880435B2 (en) Determination of intermediate representations of discovered document structures
CN112395418B (en) Method and device for extracting target object in webpage and electronic equipment
CN111143505A (en) Document processing method, device, medium and electronic equipment
US20080040660A1 (en) Method And Apparatus For Processing Electronic Documents
CN110688842B (en) Analysis method, device and server for document title level
Meuschke et al. A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents
JP2010108208A (en) Document processing apparatus
JPH11184894A (en) Method for extracting logical element and record medium
CN114743012A (en) Text recognition method and device
CN110738050A (en) Text recombination method, device and medium based on word segmentation and named entity recognition
JP2009053743A (en) Document similarity derivation apparatus, document similarity derivation method and document similarity derivation program
Kamola et al. Image-based logical document structure recognition
EP4167122A1 (en) Extracting key value pairs using positional coordinates
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN107797979B (en) Analysis device and analysis method
CN115546815A (en) Table identification method, device, equipment and storage medium
CN112651725B (en) Electronic invoice parsing method and device
JPWO2018100700A1 (en) Data conversion apparatus and data conversion method
CN109255122B (en) Method for classifying and marking thesis citation relation
CN112257400A (en) Table data extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant