CN109977366B - Catalog generation method and device - Google Patents

Catalog generation method and device Download PDF

Info

Publication number
CN109977366B
CN109977366B CN201711450681.1A CN201711450681A CN109977366B CN 109977366 B CN109977366 B CN 109977366B CN 201711450681 A CN201711450681 A CN 201711450681A CN 109977366 B CN109977366 B CN 109977366B
Authority
CN
China
Prior art keywords
paragraph
paragraphs
format
catalog
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711450681.1A
Other languages
Chinese (zh)
Other versions
CN109977366A (en
Inventor
辛洋
蒙燕玲
皮霞林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Kingsoft Mobile Technology Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Kingsoft Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Guangzhou Kingsoft Mobile Technology Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN201711450681.1A priority Critical patent/CN109977366B/en
Publication of CN109977366A publication Critical patent/CN109977366A/en
Application granted granted Critical
Publication of CN109977366B publication Critical patent/CN109977366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention provides a catalog generation method, which comprises the following steps: obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of a catalog to be generated in a document; selecting paragraphs as titles from the paragraphs of the catalog to be generated according to the paragraph identifiers and the paragraph formats; obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs; and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relation. By applying the catalog generation method provided by the embodiment of the invention, the catalog can be automatically generated, so that the catalog generation efficiency is improved, and the user experience is improved.

Description

Catalog generation method and device
Technical Field
The present invention relates to the field of computer software applications, and in particular, to a method and apparatus for generating a directory.
Background
The catalogue can intuitively present the structure and the hierarchy of the document for the user, and help the user to quickly locate the content in the document, thereby facilitating the understanding and the reading of the document by the user.
However, in the current method for generating the catalogue, it is necessary to manually select words as the content of the catalogue from the document, set information such as a title style, an outline level, etc. for the selected words one by one, and then generate the catalogue based on the information. Therefore, the catalog generation process is very complicated, so that the catalog generation efficiency of the user is low, and the experience for the user is poor.
Disclosure of Invention
The embodiment of the invention aims to provide a catalog generation method and device, so as to improve catalog generation efficiency and user experience.
In order to solve the above problems, an embodiment of the present invention provides a method for generating a directory, including:
obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of a catalog to be generated in a document;
selecting paragraphs as titles from the paragraphs of the catalog to be generated according to the paragraph identifiers and the paragraph formats;
obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs;
and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relation.
Preferably, the selecting, according to the paragraph identifier and the paragraph format, the paragraph as the title from the paragraphs in the catalog to be generated includes:
determining paragraphs of which paragraph identifiers do not belong to preset non-title paragraph identifiers in paragraphs of the catalog to be generated;
according to the paragraph format, a paragraph is selected as a title from the determined paragraphs.
Preferably, the selecting a paragraph as the title from the determined paragraphs according to the paragraph format includes:
calculating the determined predictive value of each paragraph as a title according to the paragraph format;
A paragraph is selected as a title from the determined paragraphs based on the determined predicted value for each paragraph.
Preferably, the paragraph format of a paragraph includes: numbering format, word size, last character of text and text length;
said calculating, according to paragraph format, the predicted value of each paragraph as the title, including:
according to the word sizes of the texts in the paragraphs, calculating the word size difference between each determined paragraph and the preset title word size;
obtaining a predicted value corresponding to the determined predicted element of each paragraph according to the following expression, wherein the predicted element of one paragraph comprises: the paragraph numbering format, the word size difference, the last character of the text in the paragraph and the length of the text in the paragraph:
predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element;
based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
Preferably, the non-title paragraph identification includes:
paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs.
Preferably, the obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and the format attribute of the selected paragraphs includes:
dividing the selected paragraphs into paragraph groups according to format attributes of the paragraphs;
determining the management interval of each paragraph in each paragraph group according to the paragraph number and the following expression:
when a paragraph exists in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of paragraph, paragraph number of paragraph ];
according to the segment number arrangement sequence of the selected paragraphs, and according to the management interval of the selected paragraphs and the format attribute of the selected paragraphs, the hierarchical relationship between the selected paragraphs is obtained.
Preferably, the step of obtaining the hierarchical relationship between the selected paragraphs according to the management interval of the selected paragraphs and the format attribute of the selected paragraphs includes:
obtaining the hierarchical relationship between two adjacent paragraphs in the selected paragraphs according to the sequence of the segment numbers of the selected paragraphs and in the following manner:
Determining an interval relation between a management interval of a first paragraph and a management interval of a second paragraph, wherein the first paragraph and the second paragraph are: in the selected paragraphs, two adjacent paragraphs are arranged according to the sequence of the number of the paragraphs, and the second paragraph is arranged after the first paragraph according to the sequence of the number of the paragraphs;
when the interval relation is a separation relation, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph;
if the first paragraph and the second paragraph are the same, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a peer paragraph;
if not, searching similar paragraphs, wherein the similar paragraphs are as follows: according to the sequence of segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment; if the similar paragraph exists, determining the second paragraph as the hierarchical relationship with the similar paragraph is: a peer; if the similar paragraph does not exist, determining the hierarchical relationship between the first paragraph and the second paragraph as follows: the paragraph with small paragraph number is the previous paragraph of the paragraph with large paragraph number;
and when the interval relation is a non-separation relation, executing the step of searching similar paragraphs.
Preferably, the determining whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph includes:
judging whether the first section and the second section are numbered or not;
if the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph;
if the first paragraph and the second paragraph are not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the first paragraph and the text setting of the second paragraph.
The embodiment of the invention also provides a catalog generating device, which comprises:
the paragraph information acquisition module is used for acquiring paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of the catalogue to be generated in the document;
a paragraph screening module, configured to select a paragraph as a title from the paragraphs in the catalog to be generated according to a paragraph identifier and a paragraph format;
the hierarchy analysis module is used for obtaining the hierarchy relation among the selected paragraphs according to the paragraph numbers and the format attributes of the selected paragraphs;
and the catalog generation module is used for generating the catalog of the paragraph of the catalog to be generated according to the hierarchical relationship.
Preferably, the paragraph screening module includes:
a first screening submodule, configured to determine paragraphs in the paragraphs of the catalog to be generated, where the paragraph identifiers do not belong to preset non-title paragraph identifiers;
and a second filtering sub-module, configured to select a paragraph as a title from the determined paragraphs according to the paragraph format.
Preferably, the second screening submodule includes:
a predicted value calculation unit for calculating a predicted value of each paragraph determined as a title according to the paragraph format;
and a title selecting unit for selecting a paragraph as a title from the determined paragraphs based on the determined predicted value of each paragraph.
Preferably, the paragraph format of a paragraph includes: numbering format, word size, last character of text and text length:
the predicted value calculating unit is specifically configured to:
according to the word sizes of the texts in the paragraphs, calculating the word size difference between each determined paragraph and the preset title word size;
obtaining a predicted value corresponding to the determined predicted element of each paragraph according to the following expression, wherein the predicted element of one paragraph comprises: the paragraph numbering format, the word size difference, the last character of the text in the paragraph and the length of the text in the paragraph:
Predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element;
based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
Preferably, the non-title paragraph identification includes:
paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs.
Preferably, the hierarchical analysis module includes:
a grouping sub-module for dividing the selected paragraphs into paragraph groups according to format attributes of the paragraphs;
the interval dividing sub-module is used for determining the management interval of each paragraph in each paragraph group according to the paragraph number and the following expression:
when a paragraph exists in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of paragraph, paragraph number of paragraph ];
and the hierarchy dividing sub-module is used for obtaining the hierarchy relation among the selected paragraphs according to the sequence of the segment numbers of the selected paragraphs and the management interval of the selected paragraphs and the format attribute of the selected paragraphs.
Preferably:
the hierarchy dividing sub-module is specifically configured to obtain a hierarchy relationship between two adjacent paragraphs in the selected paragraph according to the order of the segment numbers of the selected paragraph and the following manner:
determining an interval relation between a management interval of a first paragraph and a management interval of a second paragraph, wherein the first paragraph and the second paragraph are: in the selected paragraphs, two adjacent paragraphs are arranged according to the sequence of the number of the paragraphs, and the second paragraph is arranged after the first paragraph according to the sequence of the number of the paragraphs;
when the interval relation is a separation relation, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph;
if the first paragraph and the second paragraph are the same, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a peer paragraph;
if not, searching similar paragraphs, wherein the similar paragraphs are as follows: according to the sequence of segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment; if the similar paragraph exists, determining the second paragraph as the hierarchical relationship with the similar paragraph is: a peer; if the similar paragraph does not exist, determining the hierarchical relationship between the first paragraph and the second paragraph as follows: the paragraph with small paragraph number is the previous paragraph of the paragraph with large paragraph number;
And when the interval relation is a non-separation relation, executing the step of searching similar paragraphs.
Preferably, the hierarchical dividing submodule determines whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph, including:
judging whether the first section and the second section are numbered or not;
if the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph;
if the first paragraph and the second paragraph are not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the first paragraph and the text setting of the second paragraph.
The embodiment of the invention also provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface, and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
and a processor for implementing any of the above-described method steps when executing a program stored on the memory.
The embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above-described directory generation methods.
According to the catalog generation method and device, the paragraph format, the format attribute, the paragraph number and the paragraph identification of the paragraphs of each catalog to be generated in the document are obtained, the paragraphs serving as the titles are screened from the paragraphs of the catalog to be generated, the hierarchical structure of the paragraphs is divided, and the catalog is automatically generated, so that the catalog generation efficiency is improved, and the user experience is improved. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for generating a catalog according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another method for generating a directory according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another method for generating a directory according to an embodiment of the present invention;
FIG. 4 is a diagram of an exemplary catalog generated by applying the solution provided by the embodiments of the present invention;
FIG. 5 is a schematic diagram of a catalog generating apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of another catalog generating apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of another catalog generating apparatus according to an embodiment of the present invention;
fig. 8 is a structural diagram of an electronic device.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the problems that the generation process of the document catalogue is very complicated in the prior art, which leads to the generation of the catalogue by a user
The invention provides a catalog generation method and device, and aims to solve the problem of low efficiency.
The following generally describes a catalog generation method provided by an embodiment of the present invention.
In one implementation manner of the present invention, the catalog generation method includes:
obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of a catalog to be generated in a document;
selecting paragraphs as titles from the paragraphs of the catalog to be generated according to the paragraph identifiers and the paragraph formats;
obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs;
and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relationship.
From the above, when the scheme provided by the embodiment of the invention is applied to generate the catalogue in the document, the paragraph format, the format attribute, the paragraph number and the paragraph identifier of each paragraph of the catalogue to be generated in the document are obtained, the paragraphs serving as the titles are screened from the paragraphs of the catalogue to be generated, the hierarchical structure of the paragraphs is divided, and the catalogue is automatically generated, so that the generation efficiency of the catalogue is improved, and the user experience is improved.
The method for generating the catalogue provided by the embodiment of the invention is described in detail below through specific embodiments.
As shown in fig. 1, a flow chart of a catalog generating method according to an embodiment of the present invention includes the following steps:
Step S101: and obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifications of paragraphs of the catalogue to be generated in the document.
In one implementation, the paragraph format of a paragraph includes the number format of the paragraph, the word size of the text, the last character of the text, the length of the text in the paragraph, and so on; the format attribute of the paragraph can be judged according to the number format and text setting of the paragraph, wherein the text setting of the paragraph comprises centering condition, thickening condition and the like of the paragraph; the paragraph numbers of the paragraphs are serial numbers of the paragraphs in sequence in all paragraphs of the catalog to be generated; paragraph identification of a paragraph then represents the contents of the paragraph, for example: the content of the paragraph may be a picture, a directory field, a sub-document, etc.
In this step, the paragraphs of the catalog to be generated may be all paragraphs in the document, may be paragraphs selected by the user, or may be all paragraphs in a specific page number in the document, which may be specifically determined by the needs of the user, and the embodiment of the present invention is not limited to this.
Step S102: and selecting the paragraph serving as the title from the paragraphs of the catalog to be generated according to the paragraph identification and the paragraph format.
In one implementation, each paragraph of the to-be-generated directory may be traversed in turn, the traversed paragraphs of the to-be-generated directory are judged, and the paragraphs of the to-be-generated directory serving as the title are screened out until all the paragraphs of the to-be-generated directory are traversed.
Of course, the paragraphs of the catalog to be generated may be directly screened without considering the sequence, and only the paragraphs of all the catalogs to be generated are required to be screened.
In this step, the title paragraphs in the paragraphs of the catalog to be generated are distinguished from other paragraphs by screening the paragraphs of the catalog to be generated. The title itself is a summary of the document content, so that only the hierarchical relationship between paragraphs as the title needs to be divided in the following, thereby improving the efficiency of generating the catalogue.
Step S103: a hierarchical relationship between the selected paragraphs is obtained from the paragraph numbers and format properties of the selected paragraphs.
In this step, the hierarchical relationship between the selected paragraphs is the hierarchical relationship between the titles, and the hierarchical relationship between the titles can represent the hierarchical structure between the paragraphs of the catalog to be generated.
Step S104: and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relation.
In one implementation, the generated catalog may be presented in a document, for example: the page shown in the previous page of the paragraph of the catalog to be generated or the page shown in the next page of the paragraph of the catalog to be generated, etc., the embodiment of the invention is not limited thereto.
In the above, when the proposal provided by the embodiment of the invention is applied to generate the catalogue, the paragraph format, the format attribute, the paragraph number and the paragraph identification of the paragraphs of each catalog to be generated in the document are obtained, the paragraphs serving as the titles are screened from the paragraphs of the catalog to be generated, the hierarchical structure of the paragraphs is divided, and the catalog is automatically generated, so that the generation efficiency of the catalog is improved, and the user experience is improved.
As shown in fig. 2, a flow chart of another directory generating method according to an embodiment of the present invention includes the following steps:
step S201: and obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of the paragraphs of each catalog to be generated in the document.
Step S202: and determining that the paragraph mark in the paragraph of the catalog to be generated does not belong to the paragraph mark of the preset non-title paragraph.
The paragraphs of different contents have different paragraph identifiers, so that the paragraph identifiers which can determine the paragraphs not belonging to the title can be firstly screened out, and then the paragraphs which are possible to be the title paragraphs are screened out of all the paragraphs of the catalog to be generated according to the paragraph identifiers.
In one implementation, the preset non-title paragraph identification includes: paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs, although other paragraph identifiers capable of determining that a paragraph is not a title paragraph may be included.
Step S203: according to the paragraph format, a paragraph is selected as a title from the determined paragraphs.
In the last step, paragraphs that may be titles in the paragraphs of the catalog to be generated are screened out by using paragraph identification, and in the paragraphs, some text paragraphs that are not titles, such as text paragraphs, and the like, may also exist. Therefore, in this step, the paragraph to be the title is further selected from the paragraphs screened in the previous step by continuing to pass through the paragraph format of the paragraphs of the catalog to be generated.
In one implementation, each paragraph determined may be calculated as a predicted value for the title based on the paragraph format, and then the paragraph that is the title may be selected based on the predicted value.
Specifically, the predicted value of each paragraph as a title may be calculated by:
step 1: and calculating the word size difference between each determined paragraph and the preset title word size according to the word sizes of the texts in the paragraphs.
Step 2: obtaining the predicted value corresponding to the determined predicted element of each paragraph according to the following expression:
predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element
Wherein the prediction element of a paragraph comprises: the paragraph number format, the word size difference, the last character of the text in the paragraph, and the length of the text in the paragraph. The preset weight of each prediction element is determined according to the influence of different prediction elements on the prediction result, the preset offset of each prediction element is the maximum offset range allowed by the prediction element in the algorithm, the confidence interval of the prediction element is reflected, and the prediction element are obtained through training according to a machine learning algorithm in the earlier stage.
Step 3: based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
In one implementation manner, a Sigmoid function may be utilized to calculate a predicted value corresponding to each predicted element calculated in the previous step, and finally, a predicted value of each paragraph is obtained as a title.
Specifically, for calculating the predictive value of each paragraph as a title using the Sigmoid function, a threshold may be set, and when the predictive value of a paragraph as a title is greater than the threshold, the paragraph is judged to be a title paragraph, and when the predictive value of a paragraph as a title is less than the threshold, the paragraph is judged to be a text paragraph. Wherein in one implementation, the threshold may be set to 0.5.
It should be noted that, the paragraph format of each paragraph includes a plurality of elements, for example: numbering format, word size, last character of text, text length, line spacing, character spacing, etc. According to the embodiment of the invention, through a machine learning algorithm, each element is subjected to statistical calculation, and a plurality of elements with the best effect are selected as the basis of subsequent calculation according to training results. And finally, calculating the predicted value of the paragraph serving as the title according to the number format, the word size, the last character of the text and the text length of the paragraph. However, the embodiments of the present invention are described by way of example only, and the present invention is not limited thereto.
Step S204: a hierarchical relationship between the selected paragraphs is obtained from the paragraph numbers and format properties of the selected paragraphs.
Step S205: and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relation.
Step S201 is the same as step S101 in the embodiment of the invention shown in fig. 1, and steps S204 to S205 are the same as steps S103 to S104 in the embodiment of the invention shown in fig. 1, and are not described in detail herein.
In the above, when the catalog is generated by applying the scheme provided by the embodiment of the invention, the paragraphs which do not belong to the preset non-title paragraphs are screened out by the obtained paragraph identifiers of the paragraphs of each catalog to be generated, then the paragraphs which are used as the title are screened out according to the paragraph format of each paragraph, and then the hierarchical relation among the selected paragraphs is obtained according to the paragraph numbers and the format attribute of the selected paragraphs, so that the catalog is automatically generated, the generation efficiency of the catalog is improved, and the user experience is improved.
As shown in fig. 3, a flow chart of another directory generating method according to an embodiment of the present invention includes the following steps:
step S301: and obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of the paragraphs of each catalog to be generated in the document.
Step S302: and selecting the paragraph serving as the title from the paragraphs of the catalog to be generated according to the paragraph identification and the paragraph format.
Step S303: the selected paragraph is divided into paragraph groups according to the format attribute of the paragraph.
In one implementation, paragraphs of the same format attribute are divided into a set, thereby dividing paragraphs of the catalog to be generated into different paragraph groups.
Step S304: the management interval of each paragraph in each paragraph group is determined according to the paragraph number and the following expression.
Specifically, when a paragraph exists next adjacent paragraphs in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of this paragraph, paragraph number of this paragraph ].
For example, if the segment number of one paragraph is 1 and the segment number of the next adjacent paragraph within the same paragraph group is 4, the management interval of the paragraph is [1,3]; if a paragraph has a paragraph number of 6 and in the same paragraph group, the paragraph does not exist in the next adjacent paragraph, the management interval of the paragraph is [6,6].
Step S305: according to the segment number arrangement sequence of the selected paragraphs, and according to the management interval of the selected paragraphs and the format attribute of the selected paragraphs, the hierarchical relationship between the selected paragraphs is obtained.
In one implementation, the hierarchical relationship between two adjacent paragraphs in the selected paragraph is obtained according to the order of the segment numbers of the selected paragraphs and in the following manner:
step 1: an interval relationship between the management interval of the first paragraph and the management interval of the second paragraph is determined.
Wherein, in the two adjacent paragraphs according to the sequence of the segment numbers, the second paragraph is arranged after the first paragraph according to the sequence of the segment numbers. The relationship between the sections is divided into two types, namely, a phase separation relationship, an intersection relationship and an inclusion relationship, and in this scheme, the intersection relationship and the inclusion relationship between the sections are called as a non-phase separation relationship.
For example, if the management interval of the first paragraph is [1,1], the management interval of the second paragraph is [2,2], and there is no overlapping portion between the two intervals, the relationship between the intervals corresponding to the first paragraph and the second paragraph is separated; if the management interval of the first section is [1,5], the management interval of the second section is [2,2], and the management interval of the second section is completely contained in the management interval of the first section, the relation between the intervals corresponding to the first section and the second section is contained, namely a non-separation relation; if the management interval of the first section is [1,2], and the management interval of the second section is [2,3], and a part where the management interval of the first section completely coincides with the management interval of the second section exists, the relation between the intervals corresponding to the first section and the second section is an intersection or a non-separation relation.
Step 2:
first case:
the interval relation between the management interval of the first paragraph and the management interval of the second paragraph is a separation relation:
(1) Judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph.
In one implementation, determining whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph is accomplished by:
first, it is determined whether the first and second paragraphs are numbered.
If the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph. If the number formats are the same, judging that the format attribute of the first paragraph is the same as that of the second paragraph;
if the non-uniform numbers exist, namely, the first paragraph and the second paragraph are not numbered, or only one paragraph is numbered, and the other paragraph is not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the paragraphs. If the text settings of the paragraphs are the same, the format attribute of the first paragraph and the second paragraph is judged to be the same.
In one case, the text setting of a paragraph includes the size of the word size, whether the center and bold are the same, i.e., the text setting for the paragraph is the same when the size, center and bold settings are the same.
(2) If the format attribute of the first paragraph is the same as the format attribute of the second paragraph, determining the hierarchical relationship between the first paragraph and the second paragraph is: a peer paragraph.
If not, searching similar paragraphs; wherein, the similar paragraph is: according to the sequence of the segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment;
if the similar paragraph exists, determining that the second paragraph is the hierarchical relationship with the similar paragraph is: and (5) the same level.
If the similar paragraphs do not exist, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a paragraph with a small paragraph number is a paragraph of the previous stage of a paragraph with a large paragraph number.
In one implementation, when similar paragraphs are found, the preceding paragraphs are recursively found in turn, starting with the previous paragraph of the first paragraph, according to the paragraph number of each paragraph.
Second case:
the interval relation between the management interval of the first paragraph and the management interval of the second paragraph is a non-separation relation:
executing the steps of searching similar paragraphs:
if the similar paragraph exists, determining that the second paragraph is the hierarchical relationship with the similar paragraph is: and (5) the same level.
If the similar paragraphs do not exist, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a paragraph with a small paragraph number is a paragraph of the previous stage of a paragraph with a large paragraph number.
Step S306: and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relationship.
Step S301 to step S302 are the same as step S101 to step S102 in the embodiment of the invention shown in fig. 1, and step S306 is the same as step S104 in the embodiment of the invention shown in fig. 1, and will not be described in detail here.
From the above, when the catalog is generated by applying the scheme provided by the embodiment of the invention, the paragraphs serving as titles are screened out by the obtained paragraph identifiers and paragraph formats of the paragraphs of each catalog to be generated, then the management interval is divided for each paragraph according to the paragraph number of the selected paragraph, and then the hierarchical relationship among the selected paragraphs is obtained according to the format attribute of each paragraph and the relationship between the management intervals, so that the catalog is automatically generated, the generation efficiency of the catalog is improved, and the user experience is improved.
For ease of understanding, the directory generation method shown in fig. 3 is explained below by way of a specific example.
As shown in fig. 4, the directory generated by applying the scheme provided by the embodiment of the present invention is shown.
All the titles in the catalog shown in fig. 4 are paragraphs that are selected as titles from all the paragraphs of the catalog to be generated according to the paragraph identification and paragraph format.
1. The paragraphs are divided into paragraph groups according to their format properties.
It can be seen that "hierarchical summary", "purpose", "conclusion", "algorithm", "validation" and "notice" are a paragraph group, "1. Automatic test" and "2. Manual test" are a paragraph group, "1.1. Sample source", "1.2. Comparative data" and "1.3 scene" are a paragraph group, "2.1. Sample source", "2.2. Method" and "2.3 conclusion" are a paragraph group.
2. A management interval for each paragraph in each paragraph group is determined.
The management interval of the hierarchical division summary is [1,1]; the management interval of the 'purpose' is [2,2]; the management interval of the conclusion is [3,3]; the management interval of the algorithm is [4,4]; the management interval of the verification is [5,13]; the management interval of the "notice" is [14,14];
the management interval of the automatic test is [6,9]; the management interval of the manual test is [10,10];
the management interval of the sample sheet source is [7,7]; the management interval of the '1.2. Comparison data' is [8,8]; the management interval of the '1.3 scene' is [9,9];
the management interval of the '2.1. Sample sheet source' is [11,11]; the management interval of the method 2.2 is [12,12]; the management interval of "2.3 conclusion" is [13,13].
3. A hierarchical relationship between the selected paragraphs is obtained from the management section of the selected paragraphs and the format attribute of the selected paragraphs.
The management intervals between the hierarchical classification summary and the purpose, the purpose and the conclusion, the conclusion and the algorithm, the algorithm and the verification, the 1.1 sample page source and the 1.2 sample page source, the 1.2 sample page source and the 1.3 scene, the 2.1 sample page source and the 2.2 method, the 2.2 method and the 2.3 conclusion are separated, and the format attribute is the same, so that the hierarchical relations among the paragraphs are the same level;
the management intervals among the 'verification' and the '1. Automatic test', '1. Automatic test' and '1.1. Sample page source' are non-separated, and paragraphs with the same format attribute are searched recursively, and as can be seen, the '1. Automatic test' and the '1.1. Sample page source' have no similar paragraphs before, so that the 'verification' is the upper level of the '1. Automatic test', and the '1. Automatic test' is the upper level of the '1.1. Sample page source';
the management interval between the manual test and the sample sheet source 2.1 is non-separated, and then the paragraphs with the same format attribute are searched recursively, so that the sample sheet source 2.1 and the scene 1.3 are identical in format attribute, and the sample sheet source 2.1 and the scene 1.3 are the same level;
The management intervals between the scene of ' 1.3 ' and ' 2 ' manual test ', ' notice ' and ' 2.3 conclusion ' are separated, and the format attributes are different, the paragraphs with the same format attributes are searched recursively, and as can be seen, the scene of ' 2 ' manual test ' is the same as the scene of ' 1 ' automatic test ', the notice ' is the same as the scene of ' verification ' in format, and therefore, the scene of ' 2 ' manual test ' is the same as the scene of ' 1 ' automatic test ', and the notice ' is the same as the scene of ' verification '.
4. According to the hierarchical relationship, a directory of paragraphs of the directory to be generated is generated, i.e., the result as shown in fig. 4.
Corresponding to the information pushing method, the embodiment of the invention also provides a catalog generating device.
Fig. 5 is a schematic structural diagram of a catalog generating device according to an embodiment of the present invention, where the device includes:
the paragraph information obtaining module 510 is configured to obtain a paragraph format, a format attribute, a paragraph number, and a paragraph identifier of each paragraph of the catalog to be generated in the document.
And a paragraph screening module 520, configured to select a paragraph as a title from the paragraphs in the catalog to be generated according to the paragraph identifier and the paragraph format.
The hierarchical analysis module 530 is configured to obtain a hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs.
And the catalog generation module 540 is configured to generate a catalog of the paragraphs of the catalog to be generated according to the hierarchical relationship.
As can be seen from the above, in the solution provided in the embodiment of the present invention, the paragraph information obtaining module 510 obtains the paragraph format, the format attribute, the paragraph number and the paragraph identifier of each paragraph of the to-be-generated directory in the document, the paragraph screening module 520 screens out the paragraphs as the titles from the paragraphs of the to-be-generated directory, the hierarchical analysis module 530 divides the hierarchical structure of the paragraphs, and the final directory generating module 540 automatically generates the directory, thereby improving the directory generating efficiency and the user experience.
Fig. 6 is a schematic structural diagram of another catalog generating apparatus according to an embodiment of the present invention, where the apparatus includes:
the paragraph information obtaining module 610 is configured to obtain a paragraph format, a format attribute, a paragraph number, and a paragraph identifier of each paragraph of the catalog to be generated in the document.
Paragraph screening module 620, comprising:
a first filtering sub-module 621, configured to determine paragraphs of the catalog to be generated, where the paragraph identifiers do not belong to preset non-title paragraph identifiers.
In one implementation, the paragraphs that do not belong to the preset non-title are identified as: paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs.
A second filtering sub-module 622 is configured to select a paragraph as a title from the determined paragraphs according to the paragraph format.
In one implementation, the second filtering sub-module 622 includes:
a predicted value calculation unit 622 (a) for calculating a predicted value of each paragraph determined as a title according to the paragraph format;
the method is particularly used for:
in one implementation, a paragraph format for a paragraph includes: number format, word size, last character of text, and text length.
And calculating the word size difference between each determined paragraph and the preset title word size according to the word sizes of the texts in the paragraphs.
Obtaining a predicted value corresponding to the determined predicted element of each paragraph according to the following expression, wherein the predicted element of one paragraph comprises: the paragraph numbering format, the word size difference, the last character of the text in the paragraph and the length of the text in the paragraph:
predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element.
Based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
The title selecting unit 622 (b) is configured to select a paragraph as a title from the determined paragraphs according to the determined predicted value of each paragraph.
The hierarchical analysis module 630 is configured to obtain a hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs.
And the catalog generation module 640 is configured to generate a catalog of the paragraphs of the catalog to be generated according to the hierarchical relationship.
As can be seen from the above, when the scheme provided by the embodiment of the present invention is applied to generate a catalog in a document, the first filtering sub-module 621 filters out the paragraphs that do not belong to the preset non-title paragraphs, the second filtering sub-module 622 filters out the paragraphs that are the title according to the paragraph format of each paragraph, and then the hierarchical analysis module 630 obtains the hierarchical relationship between the selected paragraphs according to the paragraph number and the format attribute of the selected paragraphs, and the catalog generation module 640 automatically generates the catalog, thereby improving the generation efficiency of the catalog and improving the experience of the user.
Referring to fig. 7, a schematic structural diagram of a catalog generating device in another document according to an embodiment of the present invention includes:
the paragraph information obtaining module 710 is configured to obtain a paragraph format, a format attribute, a paragraph number, and a paragraph identifier of each paragraph of the catalog to be generated in the document.
And a paragraph screening module 720, configured to select a paragraph as a title from the paragraphs in the catalog to be generated according to the paragraph identifier and the paragraph format.
A hierarchy analysis module 730 comprising:
grouping sub-module 731 is configured to divide the selected paragraphs into paragraph groups according to format attribute of the paragraphs.
A section dividing sub-module 732, configured to determine a management section of each section in each section group according to the section number and the following expression:
when a paragraph exists in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of this paragraph, paragraph number of this paragraph ].
The hierarchy dividing sub-module 733 is configured to obtain a hierarchy relationship between the selected paragraphs according to the management section of the selected paragraphs and the format attribute of the selected paragraphs.
Specifically, determining an interval relation between a management interval of a first paragraph and a management interval of a second paragraph, wherein the first paragraph and the second paragraph are: in the selected paragraphs, two adjacent paragraphs are arranged according to the sequence of the number of the paragraphs, and the second paragraph is arranged after the first paragraph according to the sequence of the number of the paragraphs;
When the interval relation is a separation relation, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph;
if the first paragraph and the second paragraph are the same, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a peer paragraph;
if not, searching similar paragraphs, wherein the similar paragraphs are as follows: according to the sequence of segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment; if the similar paragraph exists, determining the second paragraph as the hierarchical relationship with the similar paragraph is: a peer; if the similar paragraph does not exist, determining the hierarchical relationship between the first paragraph and the second paragraph as follows: the paragraph with small paragraph number is the previous paragraph of the paragraph with large paragraph number;
and when the interval relation is a non-separation relation, executing the step of searching similar paragraphs.
In one implementation, the hierarchical partitioning submodule determines whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph, including:
judging whether the first section and the second section are numbered or not;
if the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph;
If the first paragraph and the second paragraph are not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the first paragraph.
And the catalog generation module 740 is configured to generate a catalog of the paragraphs of the catalog to be generated according to the hierarchical relationship.
As can be seen from the above, when the scheme provided by the embodiment of the present invention is applied to generate a catalog, the paragraph information obtaining module 710 obtains the paragraph identifier and the paragraph format of each paragraph of the catalog to be generated, the paragraph screening module 720 screens out the paragraphs as the titles, the grouping sub-module 731 and the interval sub-module 732 divide the management interval for each paragraph according to the paragraph number of the selected paragraph, the hierarchy sub-module 733 obtains the hierarchy relationship between the selected paragraphs according to the format attribute and the relationship between the management intervals of each paragraph, and the catalog generating module 740 automatically generates the catalog, thereby improving the generation efficiency of the catalog and improving the user experience.
The embodiment of the present invention further provides an electronic device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804,
A memory 803 for storing a computer program;
the processor 801, when executing the program stored in the memory 803, implements the following steps:
obtaining paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of each catalog to be generated in the document;
selecting paragraphs serving as titles from the paragraphs of the catalog to be generated according to paragraph identifiers and paragraph formats;
obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs;
and generating the catalogue of the paragraphs of the catalogue to be generated according to the hierarchical relation.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In the scheme provided by the embodiment of the invention, the paragraph format, the format attribute, the paragraph number and the paragraph identification of the paragraphs of each catalog to be generated in the document are obtained, the paragraphs serving as the titles are screened from the paragraphs of the catalog to be generated, the hierarchical structure of the paragraphs is divided, and the catalog is automatically generated, so that the generation efficiency of the catalog is improved, the user experience is improved, and the user experience is improved.
In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the catalog generation method of any one of the embodiments described above.
In a further embodiment of the present invention, a computer program product comprising instructions which, when run on a computer, cause the computer to perform the catalog generation method of any one of the embodiments described above is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (11)

1. A method of catalog generation, the method comprising:
obtaining paragraph format, format attribute, paragraph number and paragraph identification of paragraphs of a catalog to be generated in a document, wherein the paragraph identification reflects the contents of the paragraphs;
selecting paragraphs as titles from the paragraphs of the catalog to be generated according to the paragraph identifiers and the paragraph formats;
obtaining the hierarchical relationship between the selected paragraphs according to the paragraph numbers and format attributes of the selected paragraphs;
generating a catalog of paragraphs of the catalog to be generated according to the hierarchical relationship;
wherein selecting a paragraph as a title from the paragraphs of the catalog to be generated according to the paragraph identification and the paragraph format comprises:
determining paragraphs of which paragraph identifiers do not belong to preset non-title paragraph identifiers in paragraphs of the catalog to be generated;
calculating the determined predictive value of each paragraph as a title according to the paragraph format;
selecting a paragraph as a title from the determined paragraphs according to the determined predicted value of each paragraph;
Said calculating, according to paragraph format, the predicted value of each paragraph as the title, including:
according to the word sizes of the texts in the paragraphs, calculating the word size difference between each determined paragraph and the preset title word size;
obtaining a predicted value corresponding to the determined predicted element of each paragraph according to the following expression, wherein the predicted element of one paragraph comprises at least one of the following elements: the paragraph numbering format, the word size difference, the last character of the text in the paragraph and the length of the text in the paragraph:
predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element;
based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
2. The method of claim 1, wherein the non-title paragraph identification comprises:
paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs.
3. The method of claim 1, wherein the obtaining a hierarchical relationship between the selected paragraphs based on the paragraph numbers and format properties of the selected paragraphs comprises:
Dividing the selected paragraphs into paragraph groups according to format attributes of the paragraphs;
determining the management interval of each paragraph in each paragraph group according to the paragraph number and the following expression:
when a paragraph exists in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of paragraph, paragraph number of paragraph ];
according to the segment number arrangement sequence of the selected paragraphs, and according to the management interval of the selected paragraphs and the format attribute of the selected paragraphs, the hierarchical relationship between the selected paragraphs is obtained.
4. A method according to claim 3, wherein said obtaining a hierarchical relationship between the selected paragraphs according to the management section of the selected paragraphs and the format attribute of the selected paragraphs, and according to the segment number arrangement order of the selected paragraphs, comprises:
obtaining the hierarchical relationship between two adjacent paragraphs in the selected paragraphs according to the sequence of the segment numbers of the selected paragraphs and in the following manner:
determining an interval relation between a management interval of a first paragraph and a management interval of a second paragraph, wherein the first paragraph and the second paragraph are: in the selected paragraphs, two adjacent paragraphs are arranged according to the sequence of the number of the paragraphs, and the second paragraph is arranged after the first paragraph according to the sequence of the number of the paragraphs;
When the interval relation is a separation relation, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph;
if the first paragraph and the second paragraph are the same, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a peer paragraph;
if not, searching similar paragraphs, wherein the similar paragraphs are as follows: according to the sequence of segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment; if the similar paragraph exists, determining the second paragraph as the hierarchical relationship with the similar paragraph is: a peer; if the similar paragraph does not exist, determining the hierarchical relationship between the first paragraph and the second paragraph as follows: the paragraph with small paragraph number is the previous paragraph of the paragraph with large paragraph number;
and when the interval relation is a non-separation relation, executing the step of searching similar paragraphs.
5. The method of claim 4, wherein the determining whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph comprises:
judging whether the first section and the second section are numbered or not;
if the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph;
If the first paragraph and the second paragraph are not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the first paragraph and the text setting of the second paragraph.
6. A catalog generating apparatus, the apparatus comprising:
the paragraph information acquisition module is used for acquiring paragraph formats, format attributes, paragraph numbers and paragraph identifiers of paragraphs of the catalogue to be generated in the document, wherein the paragraph identifiers embody contents of the paragraphs;
a paragraph screening module, configured to select a paragraph as a title from the paragraphs in the catalog to be generated according to a paragraph identifier and a paragraph format;
the hierarchy analysis module is used for obtaining the hierarchy relation among the selected paragraphs according to the paragraph numbers and the format attributes of the selected paragraphs;
the catalog generation module is used for generating the catalog of the paragraph of the catalog to be generated according to the hierarchical relationship;
wherein, the paragraph screening module includes:
a first screening submodule, configured to determine paragraphs in the paragraphs of the catalog to be generated, where the paragraph identifiers do not belong to preset non-title paragraph identifiers;
a second screening sub-module, configured to calculate, according to the paragraph format, a predicted value of each determined paragraph as a title; selecting a paragraph as a title from the determined paragraphs according to the determined predicted value of each paragraph;
The second screening submodule is specifically configured to:
according to the word sizes of the texts in the paragraphs, calculating the word size difference between each determined paragraph and the preset title word size;
obtaining a predicted value corresponding to the determined predicted element of each paragraph according to the following expression, wherein the predicted element of one paragraph comprises at least one of the following elements: the paragraph numbering format, the word size difference, the last character of the text in the paragraph and the length of the text in the paragraph:
predicted value corresponding to a predicted element = preset weight of the predicted element + preset offset of the predicted element;
based on the obtained predicted values, the determined predicted value of each paragraph is calculated as a title.
7. The apparatus of claim 6, wherein the non-title paragraph identification comprises:
paragraph identifiers representing sub-documents, paragraph identifiers representing tables, paragraph identifiers representing directory fields, paragraph identifiers representing pictures, and paragraph identifiers identifying blank paragraphs.
8. The apparatus of claim 6, wherein the hierarchy analysis module comprises:
a grouping sub-module for dividing the selected paragraphs into paragraph groups according to format attributes of the paragraphs;
The interval dividing sub-module is used for determining the management interval of each paragraph in each paragraph group according to the paragraph number and the following expression:
when a paragraph exists in the paragraph group, the management interval of the paragraph is: [ paragraph number of this paragraph, paragraph number-1 of the next adjacent paragraph in the paragraph group to which this paragraph belongs ]; when the paragraph does not exist in the paragraph group, the management interval of the paragraph is as follows: [ paragraph number of paragraph, paragraph number of paragraph ];
and the hierarchy dividing sub-module is used for obtaining the hierarchy relation among the selected paragraphs according to the sequence of the segment numbers of the selected paragraphs and the management interval of the selected paragraphs and the format attribute of the selected paragraphs.
9. The apparatus according to claim 8, wherein:
the hierarchy dividing sub-module is specifically configured to obtain a hierarchy relationship between two adjacent paragraphs in the selected paragraph according to the order of the segment numbers of the selected paragraph and the following manner:
determining an interval relation between a management interval of a first paragraph and a management interval of a second paragraph, wherein the first paragraph and the second paragraph are: in the selected paragraphs, two adjacent paragraphs are arranged according to the sequence of the number of the paragraphs, and the second paragraph is arranged after the first paragraph according to the sequence of the number of the paragraphs;
When the interval relation is a separation relation, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph;
if the first paragraph and the second paragraph are the same, determining the hierarchical relationship between the first paragraph and the second paragraph is as follows: a peer paragraph;
if not, searching similar paragraphs, wherein the similar paragraphs are as follows: according to the sequence of segment numbers, the format attribute of the selected segments is the same as that of the second segment before the first segment; if the similar paragraph exists, determining the second paragraph as the hierarchical relationship with the similar paragraph is: a peer; if the similar paragraph does not exist, determining the hierarchical relationship between the first paragraph and the second paragraph as follows: the paragraph with small paragraph number is the previous paragraph of the paragraph with large paragraph number;
and when the interval relation is a non-separation relation, executing the step of searching similar paragraphs.
10. The apparatus of claim 9, wherein the hierarchy submodule determines whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph, comprising:
judging whether the first section and the second section are numbered or not;
If the first paragraph and the second paragraph are numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the number format of the first paragraph and the number format of the second paragraph;
if the first paragraph and the second paragraph are not numbered, judging whether the format attribute of the first paragraph is the same as the format attribute of the second paragraph according to the text setting of the first paragraph and the text setting of the second paragraph.
11. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.
CN201711450681.1A 2017-12-27 2017-12-27 Catalog generation method and device Active CN109977366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711450681.1A CN109977366B (en) 2017-12-27 2017-12-27 Catalog generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711450681.1A CN109977366B (en) 2017-12-27 2017-12-27 Catalog generation method and device

Publications (2)

Publication Number Publication Date
CN109977366A CN109977366A (en) 2019-07-05
CN109977366B true CN109977366B (en) 2023-10-31

Family

ID=67071916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711450681.1A Active CN109977366B (en) 2017-12-27 2017-12-27 Catalog generation method and device

Country Status (1)

Country Link
CN (1) CN109977366B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427614B (en) * 2019-07-16 2023-08-08 深圳追一科技有限公司 Construction method and device of paragraph level, electronic equipment and storage medium
CN110704573B (en) * 2019-09-04 2023-12-22 平安科技(深圳)有限公司 Catalog storage method, catalog storage device, computer equipment and storage medium
CN113642320A (en) * 2020-04-27 2021-11-12 北京庖丁科技有限公司 Method, device, equipment and medium for extracting document directory structure
CN113723078A (en) * 2021-09-07 2021-11-30 杭州叙简科技股份有限公司 Text logic information structuring method and device and electronic equipment
CN113822023B (en) * 2021-09-10 2023-08-18 厦门盈趣科技股份有限公司 Automatic standard document generation method and system
CN115995087B (en) * 2023-03-23 2023-06-20 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375806A (en) * 2010-08-23 2012-03-14 北大方正集团有限公司 Document title extraction method and device
WO2016128310A1 (en) * 2015-02-13 2016-08-18 Valipat Method and system for automatically generating documents on the basis of an index
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN107301184A (en) * 2016-04-14 2017-10-27 珠海金山办公软件有限公司 It is a kind of to recognize the method and device that word or file generates catalogue

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002337423A1 (en) * 2001-08-27 2003-03-10 E-Base, Ltd. Method for defining and optimizing criteria used to detect a contextualy specific concept within a paragraph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375806A (en) * 2010-08-23 2012-03-14 北大方正集团有限公司 Document title extraction method and device
WO2016128310A1 (en) * 2015-02-13 2016-08-18 Valipat Method and system for automatically generating documents on the basis of an index
CN107301184A (en) * 2016-04-14 2017-10-27 珠海金山办公软件有限公司 It is a kind of to recognize the method and device that word or file generates catalogue
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A word-salad filtering algorithm;Jeong, Ok-Ran et al;《LOGIC JOURNAL OF THE IGPL》;第19卷(第5期);第666-678页 *
Word环境下论文格式模板制作;戴德宝;《电脑知识与技术》;20090305(第07期);177-178 *
文档目录轻松做;仲勇 等;《电脑迷》(第10期);第72页 *

Also Published As

Publication number Publication date
CN109977366A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977366B (en) Catalog generation method and device
CN106598999B (en) Method and device for calculating text theme attribution degree
CN106959976B (en) Search processing method and device
CN109241003B (en) File management method and device
CN112651217B (en) Paper document processing method, paper document processing device, electronic equipment and storage medium
CN109743309B (en) Illegal request identification method and device and electronic equipment
CN111309970A (en) Data retrieval method and device, electronic equipment and storage medium
JP2023501010A (en) A Classification Method for Application Preference Text Based on TextRank
CN111353071A (en) Label generation method and device
CN110147223B (en) Method, device and equipment for generating component library
CN107302444B (en) Enterprise-level search application server cluster automatic capacity expansion method and device
CN111046627B (en) Chinese character display method and system
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN112307318A (en) Content publishing method, system and device
CN111078773B (en) Data processing method and device
CN107239568B (en) Distributed index implementation method and device
CN110427496B (en) Knowledge graph expansion method and device for text processing
JP7033115B2 (en) Search processing method and device based on clipboard data
CN110717036B (en) Method and device for removing duplication of uniform resource locator and electronic equipment
CN113407678B (en) Knowledge graph construction method, device and equipment
CN110728113A (en) Information screening method and device of electronic forms and terminal equipment
JP2011175231A (en) Map data
CN107220249B (en) Classification-based full-text search
CN114490651A (en) Data storage method and device
CN109710833B (en) Method and apparatus for determining content node

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant