CN114118074A - Title identification method and device, electronic equipment and storage medium - Google Patents

Title identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114118074A
CN114118074A CN202111340991.4A CN202111340991A CN114118074A CN 114118074 A CN114118074 A CN 114118074A CN 202111340991 A CN202111340991 A CN 202111340991A CN 114118074 A CN114118074 A CN 114118074A
Authority
CN
China
Prior art keywords
title
paragraph
paragraphs
determining
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111340991.4A
Other languages
Chinese (zh)
Inventor
辛洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Wuhan Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Wuhan Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN202111340991.4A priority Critical patent/CN114118074A/en
Publication of CN114118074A publication Critical patent/CN114118074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography

Abstract

The invention discloses a method and a device for identifying a title, electronic equipment and a storage medium. The method comprises the following steps: acquiring a document to be identified; determining a title paragraph set in the document to be identified according to the first paragraph features, wherein the title paragraph set is one or more; merging the paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set; the title name paragraph set is one or more; and determining the title name paragraph set meeting the preset main title condition as a main title.

Description

Title identification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer software application technologies, and in particular, to a method and an apparatus for identifying a title, an electronic device, and a storage medium.
Background
Currently, when a user edits a document by using document editing software, the user may need to edit a part of the document with a fixed typesetting format. Such as administrative documents. In the administrative official document, there is a fixed typesetting format for the contents of the title, for example, the title is generally marked with a 2 # small letter in songhua, arranged in two rows below the red separation line, and arranged in the middle of one or more rows.
The user often needs to manually adjust the typesetting format for the content with the fixed typesetting format in the document, such as a title, so that the operation is complicated, and errors are easy to occur.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a device, electronic equipment and a storage medium for identifying titles, so as to solve the problem that manual operation of a user is complicated, realize the function of automatically identifying the titles in a document and conveniently and automatically adjust the typesetting format of the titles. The specific technical scheme is as follows.
The embodiment of the invention provides a method for identifying a title, which comprises the following steps:
acquiring a document to be identified;
determining a title paragraph set in the document to be identified according to the first paragraph features, wherein the title paragraph set is one or more;
merging the paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set; the title name paragraph set is one or more;
and determining the title name paragraph set meeting the preset main title condition as a main title.
Optionally, the merging paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set includes:
and collecting the title paragraphs with the same characteristics and continuous paragraph numbers in the title paragraph set as a title name paragraph set.
Optionally, the collecting, as a title name paragraph set, title paragraphs in the title paragraph set that have the same feature and consecutive paragraph numbers in the second paragraph includes:
traversing the title paragraphs in the set of title paragraphs according to paragraph numbers;
collecting a currently traversed title paragraph and a next title paragraph in a same title name paragraph set in case that a second paragraph feature of the currently traversed title paragraph is the same as the next title paragraph;
in the event that the second paragraph feature of the currently traversed title paragraph is different from the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different title name paragraph sets.
Optionally, the determining, as a subject, the set of subject name paragraphs that meet a preset subject condition includes:
traversing the title name paragraph set and judging whether the title name paragraph set meets a preset main title condition;
and under the condition that the title name paragraph set meets the preset main title condition, determining the title name paragraph set as a main title.
Optionally, the determining whether the title name paragraph set meets a preset main title condition includes:
determining that the title name paragraph set meets the preset main heading condition when a title paragraph with the smallest paragraph number exists in the title paragraphs contained in the title name paragraph set;
and in the case that no title paragraph with the smallest paragraph number exists in the title paragraphs contained in the title paragraph set, determining that the title paragraph set does not meet the preset main title condition.
Optionally, the method further comprises:
and under the condition that the title name paragraph set does not meet the preset main title condition, determining the title name paragraph set as a subheading.
Optionally, after traversing the title name paragraph set, before determining whether the title name paragraph set meets a preset main title condition, the method further includes:
preprocessing the title name paragraph set to obtain a title character string;
determining the title type according to the key characters contained in the title character string under the condition that the title character string contains any key character in a key character set; the title types correspond to the title name paragraph sets one by one, and any key character in the key character set corresponds to one title type.
Optionally, the preprocessing the title name paragraph set to obtain a title character string includes:
and splicing the characters contained in the title paragraphs in the title name paragraph set, and deleting preset characters according to a splicing result to obtain a title character string.
Optionally, the preset characters include: blank characters; and/or characters between preset symbols.
Optionally, the determining the title name paragraph set as a main title includes:
and marking the title paragraphs in the title name paragraph set according to the main title and the title types corresponding to the title name paragraph set.
Optionally, the method further comprises:
and under the condition that the title name paragraph set does not meet the preset main title condition, marking the title paragraphs in the title name paragraph set according to the subtitles and the title types corresponding to the title name paragraph set.
Optionally, the second paragraph feature comprises: size and/or alignment.
Optionally, the determining a set of title paragraphs in the document to be identified according to the first paragraph feature includes:
determining one or more title paragraphs in the document to be identified according to the first paragraph characteristics; traversing the title paragraph;
collecting a currently traversed title paragraph and a next title paragraph in the same title paragraph set if the paragraph number of the currently traversed title paragraph is consecutive to the paragraph number of the next title paragraph; in the case where the paragraph number of the currently traversed title paragraph is not consecutive with the paragraph number of the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different sets of title paragraphs.
Optionally, the method further comprises:
determining a text paragraph set in the document to be identified according to the third paragraph characteristics;
determining an alternative hierarchical structure according to the text paragraph set; the determined alternative hierarchies are one or more;
determining a text hierarchy of the document to be identified in the determined alternative hierarchies;
and determining a hierarchical title according to the text hierarchical structure of the document to be identified.
Optionally, the determining an alternative hierarchical structure from the set of body paragraphs includes:
determining paragraphs containing numbers in the text paragraph set;
determining a hierarchical relationship between paragraphs for the determined paragraphs;
determining the hierarchical relationship among the formats of the numbers contained in the paragraphs according to the hierarchical relationship among the paragraphs;
determining an alternative hierarchical structure according to the hierarchical relationship between the determined number formats; one or more levels are included in the determined alternative hierarchical structure, with different levels included corresponding to different numbering formats.
Optionally, the determining a text hierarchy of the document to be recognized in the determined alternative hierarchies includes:
traversing the alternative hierarchical structure and determining the number of paragraphs corresponding to the alternative hierarchical structure; wherein the alternative hierarchy corresponding paragraphs are: in the text paragraph set, the numbered paragraphs are contained in the same format as any numbered paragraph in the alternative hierarchical structure;
and after traversing, determining the text hierarchical structure of the document to be identified according to the number of paragraphs corresponding to the alternative hierarchical structure.
Optionally, the determining the text hierarchy of the document to be recognized according to the number of paragraphs corresponding to the alternative hierarchy includes:
determining the alternative hierarchical structure with the largest number of corresponding paragraphs as the text hierarchical structure of the document to be identified under the condition that the determined alternative hierarchical structure does not contain any preset hierarchical structure;
and under the condition that the determined alternative hierarchical structure comprises any preset hierarchical structure, determining the preset hierarchical structure with the maximum number of corresponding paragraphs as the text hierarchical structure of the document to be identified.
Optionally, the determining a hierarchical title according to the body hierarchical structure of the document to be recognized includes:
determining a paragraph corresponding to the text hierarchy; wherein the text hierarchy corresponding paragraphs are: in the text paragraph set, the numbered paragraphs are contained in the same format as any numbered paragraph in the text hierarchy;
traversing the determined corresponding paragraph, determining the corresponding level of the numbering format contained in the currently traversed corresponding paragraph in the text hierarchy, and determining the currently traversed corresponding paragraph as a level title according to the determined level.
Optionally, the determining, according to the determined hierarchy, the corresponding paragraph of the current traversal as a hierarchy header includes:
determining the corresponding paragraph of the current traversal as a title or title body corresponding to the determined level.
Optionally, the determining a text paragraph set in the document to be recognized according to the third paragraph feature includes:
determining text paragraphs in the document to be identified according to the third paragraph characteristics; the text paragraphs are one or more;
traversing the text paragraphs;
under the condition that the paragraph sequence number of a currently traversed text paragraph is continuous with the paragraph sequence number of a next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in the same text paragraph set;
and under the condition that the paragraph number of the currently traversed text paragraph is not continuous with the paragraph number of the next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in different text paragraph sets.
Optionally, the determining a text paragraph set in the document to be recognized according to the third paragraph feature includes:
dividing the document to be identified into attachment documents according to the characteristics of preset attachment paragraphs; the attachment file is one or more;
for the attachment document, collecting paragraphs with the same third paragraph characteristics in the attachment document as a prepared paragraph set; the prepared paragraph set is one or more;
for the paragraph with the smallest paragraph number and the paragraph with the largest paragraph number in the preparation paragraph set, determining all paragraphs between the paragraph with the smallest paragraph number and the paragraph with the largest paragraph number in the document to be identified, and adding the determined paragraphs into the preparation paragraph set;
performing merging processing on all the prepared paragraph sets in the attachment document, wherein the merging processing comprises merging the prepared paragraph sets with intersection; no intersection exists between different combined preparatory paragraph sets;
traversing the combined preparation paragraph set, and determining the currently traversed preparation paragraph set as a preparation title or a preparation text;
determining a sub-document according to the determined preparation title and the preparation text; one or more sub-documents are provided; and aiming at the subdocuments, collecting the paragraphs determined as the prepared text in the subdocuments to obtain a text paragraph set.
Optionally, the determining, for the determined paragraphs, a hierarchical relationship between paragraphs includes:
adding paragraphs containing numbers with the same format in the text paragraph set to the same text paragraph subset; one or more text paragraph subsets are obtained;
traversing the paragraphs in the text paragraph subset from small to large according to paragraph numbers aiming at the text paragraph subset;
adding paragraphs, of the text paragraph set, with paragraph sequence numbers greater than or equal to the paragraph sequence number of the currently traversed paragraph and less than the paragraph sequence number of the next paragraph to an overlay paragraph set of the currently traversed paragraph under the condition that the next paragraph exists;
in the absence of the next paragraph, adding the currently traversed paragraph to an overlay paragraph set of the currently traversed paragraph;
and determining the hierarchical relationship among the paragraphs according to the covering paragraph set of the numbered paragraphs in the text paragraph set.
Optionally, the determining a hierarchical relationship between paragraphs according to an overlay paragraph set including numbered paragraphs in the body paragraph set includes:
traversing paragraphs containing numbers in the text paragraph set according to the paragraph numbers and the sequence from small to large;
in the case that there is no intersection between the set of covered paragraphs of the currently traversed paragraph and the set of covered paragraphs of the next paragraph, if the format of the currently traversed paragraph containing the number is the same as the format of the next paragraph containing the number, determining that the hierarchical relationship between the currently traversed paragraph and the next paragraph is used for characterizing the same hierarchy; if the format of the paragraph containing number of the current traversal is different from the format of the paragraph containing number of the next traversal, determining the hierarchical relationship according to the similar paragraphs;
determining a hierarchical relationship according to the similar paragraphs under the condition that an intersection exists between the coverage paragraph set of the currently traversed paragraph and the coverage paragraph set of the next paragraph;
the similar paragraphs are paragraphs that contain numbers in the same format as the next paragraph contains numbers.
Optionally, the determining a hierarchical relationship according to similar paragraphs includes:
in the case that a similar paragraph exists in the traversed paragraphs, determining a hierarchical relationship between the similar paragraph and the next paragraph for characterizing the same hierarchy;
in the case that the similar paragraph does not exist in the traversed paragraphs, determining a hierarchical relationship between the currently traversed paragraph and the next paragraph for characterization, wherein the currently traversed paragraph corresponds to a level one level higher than the level corresponding to the next paragraph.
Optionally, the determining a hierarchical relationship between formats of numbers contained in paragraphs according to the hierarchical relationship between paragraphs includes:
traversing the hierarchical relationship between the paragraphs;
determining paragraphs contained in the currently traversed hierarchical relationship aiming at the currently traversed hierarchical relationship;
determining a format containing numbers in the paragraphs according to the determined paragraphs;
and determining the hierarchy relationship of the current traversal as the hierarchy relationship between the determined number formats.
Optionally, the determining an alternative hierarchy according to the determined hierarchical relationship between the numbering formats includes:
determining a temporary hierarchical structure according to the hierarchical relationship between the determined number formats; the determined temporary hierarchical structure is one or more, one or more levels are included, and different levels are included and correspond to different numbering formats;
and traversing the determined temporary hierarchical structure, and determining the currently traversed temporary hierarchical structure as an alternative hierarchical structure under the condition that the currently traversed temporary hierarchical structure exists in a preset hierarchical structure set.
The embodiment of the invention also provides a device for identifying the title, which comprises:
the acquisition module is used for acquiring a document to be identified;
the paragraph set determining module is used for determining a title paragraph set in the document to be identified according to the first paragraph characteristics, wherein the title paragraph set is one or more;
a title name determining module, configured to merge paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set; the title name paragraph set is one or more;
and the main title determining module is used for determining the title name paragraph set meeting the preset main title condition as a main title.
Optionally, the title name determining module is configured to:
and collecting the title paragraphs with the same characteristics and continuous paragraph numbers in the title paragraph set as a title name paragraph set.
Optionally, the title name determining module includes:
the paragraph traversing submodule is used for traversing the title paragraphs in the title paragraph set according to the paragraph serial numbers;
the collecting submodule is used for collecting the currently traversed title paragraph and the next title paragraph in the same title name paragraph set under the condition that the second paragraph feature of the currently traversed title paragraph is the same as that of the next title paragraph; in the event that the second paragraph feature of the currently traversed title paragraph is different from the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different title name paragraph sets.
Optionally, the main title determining module includes:
the traversal submodule is used for traversing the title name paragraph set and judging whether the title name paragraph set meets the preset main title condition or not;
and the condition submodule is used for determining the title name paragraph set as a main title under the condition that the title name paragraph set meets the preset main title condition.
Optionally, the traversing sub-module includes:
a condition judgment sub-module, configured to determine that the title name paragraph set meets the preset main heading condition when a title paragraph with a smallest paragraph number exists in the title paragraphs included in the title name paragraph set; and in the case that no title paragraph with the smallest paragraph number exists in the title paragraphs contained in the title paragraph set, determining that the title paragraph set does not meet the preset main title condition.
Optionally, the apparatus further comprises:
and the subtitle determining module is used for determining the title name paragraph set as a subtitle under the condition that the title name paragraph set does not accord with the preset main title condition.
Optionally, the main title determining module further includes:
the preprocessing submodule is used for preprocessing the title name paragraph set to obtain a title character string;
the title type determining submodule is used for determining the title type according to the key characters contained in the title character string under the condition that the title character string contains any key character in the key character set; the title types correspond to the title name paragraph sets one by one, and any key character in the key character set corresponds to one title type.
Optionally, the preprocessing sub-module is configured to splice characters included in the title paragraphs in the title name paragraph set, and delete preset characters according to a splicing result to obtain a title character string.
Optionally, the preset characters include: blank characters; and/or characters between preset symbols.
Optionally, the condition submodule includes:
and the main title marking submodule is used for marking the title paragraphs in the title name paragraph set according to the main title and the title types corresponding to the title name paragraph set.
Optionally, the main title determining module further includes:
and the sub-title marking sub-module is used for marking the title paragraphs in the title name paragraph set according to the sub-title and the title types corresponding to the title name paragraph set under the condition that the title name paragraph set does not accord with the preset main title condition.
Optionally, the second paragraph feature comprises: size and/or alignment.
Optionally, the paragraph set determining module includes:
the paragraph determining submodule is used for determining one or more title paragraphs in the document to be identified according to the first paragraph characteristics;
a paragraph collection submodule for traversing the title paragraphs; collecting a currently traversed title paragraph and a next title paragraph in the same title paragraph set if the paragraph number of the currently traversed title paragraph is consecutive to the paragraph number of the next title paragraph; in the case where the paragraph number of the currently traversed title paragraph is not consecutive with the paragraph number of the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different sets of title paragraphs.
Optionally, the apparatus further comprises:
the text determining unit is used for determining a text paragraph set in the document to be identified according to the third paragraph characteristics;
an alternative structure determining unit, configured to determine an alternative hierarchical structure according to the text paragraph set; the determined alternative hierarchies are one or more;
the text structure determining unit is used for determining the text hierarchical structure of the document to be identified in the determined alternative hierarchical structure;
and the hierarchy title determining unit is used for determining a hierarchy title according to the text hierarchy of the document to be identified.
Optionally, the text determining unit includes:
the hierarchical relation determining subunit is used for determining the paragraphs containing the numbers in the text paragraph set; determining a hierarchical relationship between paragraphs for the determined paragraphs; determining the hierarchical relationship among the formats of the numbers contained in the paragraphs according to the hierarchical relationship among the paragraphs;
the alternative determining subunit is used for determining an alternative hierarchical structure according to the hierarchical relationship between the determined number formats; one or more levels are included in the determined alternative hierarchical structure, with different levels included corresponding to different numbering formats.
Optionally, the text structure determining unit includes:
the alternative traversing subunit is used for traversing the alternative hierarchical structure and determining the number of the paragraphs corresponding to the alternative hierarchical structure; wherein the alternative hierarchy corresponding paragraphs are: in the text paragraph set, the numbered paragraphs are contained in the same format as any numbered paragraph in the alternative hierarchical structure;
and the text determining subunit determines the text hierarchical structure of the document to be identified according to the number of paragraphs corresponding to the alternative hierarchical structure after traversal is finished.
Optionally, the text determining subunit is configured to:
determining the alternative hierarchical structure with the largest number of corresponding paragraphs as the text hierarchical structure of the document to be identified under the condition that the determined alternative hierarchical structure does not contain any preset hierarchical structure;
and under the condition that the determined alternative hierarchical structure comprises any preset hierarchical structure, determining the preset hierarchical structure with the maximum number of corresponding paragraphs as the text hierarchical structure of the document to be identified.
Optionally, the hierarchical title determining unit includes:
a text corresponding paragraph determining subunit, configured to determine a paragraph corresponding to the text hierarchical structure; wherein the text hierarchy corresponding paragraphs are: in the text paragraph set, the numbered paragraphs are contained in the same format as any numbered paragraph in the text hierarchy;
and the title determining subunit is used for traversing the determined corresponding paragraph, determining the corresponding level of the numbering format contained in the currently traversed corresponding paragraph in the text hierarchy, and determining the currently traversed corresponding paragraph as a level title according to the determined level.
Optionally, the title determining subunit is configured to:
determining the corresponding paragraph of the current traversal as a title or title body corresponding to the determined level.
Optionally, the text determining unit is configured to:
determining text paragraphs in the document to be identified according to the third paragraph characteristics; the text paragraphs are one or more;
traversing the text paragraphs;
under the condition that the paragraph sequence number of a currently traversed text paragraph is continuous with the paragraph sequence number of a next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in the same text paragraph set;
and under the condition that the paragraph number of the currently traversed text paragraph is not continuous with the paragraph number of the next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in different text paragraph sets.
Optionally, the text determining unit is configured to:
dividing the document to be identified into attachment documents according to the characteristics of preset attachment paragraphs; the attachment file is one or more;
for the attachment document, collecting paragraphs with the same third paragraph characteristics in the attachment document as a prepared paragraph set; the prepared paragraph set is one or more;
for the paragraph with the smallest paragraph number and the paragraph with the largest paragraph number in the preparation paragraph set, determining all paragraphs between the paragraph with the smallest paragraph number and the paragraph with the largest paragraph number in the document to be identified, and adding the determined paragraphs into the preparation paragraph set;
performing merging processing on all the prepared paragraph sets in the attachment document, wherein the merging processing comprises merging the prepared paragraph sets with intersection; no intersection exists between different combined preparatory paragraph sets;
traversing the combined preparation paragraph set, and determining the currently traversed preparation paragraph set as a preparation title or a preparation text;
determining a sub-document according to the determined preparation title and the preparation text; one or more sub-documents are provided; and aiming at the subdocuments, collecting the paragraphs determined as the prepared text in the subdocuments to obtain a text paragraph set.
Optionally, the hierarchical relationship determining subunit is configured to:
adding paragraphs containing numbers with the same format in the text paragraph set to the same text paragraph subset; one or more text paragraph subsets are obtained;
traversing the paragraphs in the text paragraph subset from small to large according to paragraph numbers aiming at the text paragraph subset;
adding paragraphs, of the text paragraph set, with paragraph sequence numbers greater than or equal to the paragraph sequence number of the currently traversed paragraph and less than the paragraph sequence number of the next paragraph to an overlay paragraph set of the currently traversed paragraph under the condition that the next paragraph exists;
in the absence of the next paragraph, adding the currently traversed paragraph to an overlay paragraph set of the currently traversed paragraph;
and determining the hierarchical relationship among the paragraphs according to the covering paragraph set of the numbered paragraphs in the text paragraph set.
Optionally, the hierarchical relationship determining subunit is configured to:
traversing paragraphs containing numbers in the text paragraph set according to the paragraph numbers and the sequence from small to large;
in the case that there is no intersection between the set of covered paragraphs of the currently traversed paragraph and the set of covered paragraphs of the next paragraph, if the format of the currently traversed paragraph containing the number is the same as the format of the next paragraph containing the number, determining that the hierarchical relationship between the currently traversed paragraph and the next paragraph is used for characterizing the same hierarchy; if the format of the paragraph containing number of the current traversal is different from the format of the paragraph containing number of the next traversal, determining the hierarchical relationship according to the similar paragraphs;
determining a hierarchical relationship according to the similar paragraphs under the condition that an intersection exists between the coverage paragraph set of the currently traversed paragraph and the coverage paragraph set of the next paragraph;
the similar paragraphs are paragraphs that contain numbers in the same format as the next paragraph contains numbers.
Optionally, the hierarchical relationship determining subunit is configured to:
in the case that a similar paragraph exists in the traversed paragraphs, determining a hierarchical relationship between the similar paragraph and the next paragraph for characterizing the same hierarchy;
in the case that the similar paragraph does not exist in the traversed paragraphs, determining a hierarchical relationship between the currently traversed paragraph and the next paragraph for characterization, wherein the currently traversed paragraph corresponds to a level one level higher than the level corresponding to the next paragraph.
Optionally, the hierarchical relationship determining subunit is configured to:
traversing the hierarchical relationship between the paragraphs;
determining paragraphs contained in the currently traversed hierarchical relationship aiming at the currently traversed hierarchical relationship;
determining a format containing numbers in the paragraphs according to the determined paragraphs;
and determining the hierarchy relationship of the current traversal as the hierarchy relationship between the determined number formats.
Optionally, the alternative determining subunit is configured to:
determining a temporary hierarchical structure according to the hierarchical relationship between the determined number formats; the determined temporary hierarchical structure is one or more, one or more levels are included, and different levels are included and correspond to different numbering formats;
and traversing the determined temporary hierarchical structure, and determining the currently traversed temporary hierarchical structure as an alternative hierarchical structure under the condition that the currently traversed temporary hierarchical structure exists in a preset hierarchical structure set.
The embodiment of the invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
and a processor for implementing any of the above method steps for identifying a title when executing the program stored in the memory.
An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above method steps for identifying a title.
According to the method for identifying the title provided by the embodiment of the invention, the title paragraph set and the main title in the title paragraph set can be determined by utilizing the characteristics of the title, and the method is compatible with the conditions of various editing errors of a user and can identify the title under various conditions in the complex document editing condition. The method and the system can not only be fault-tolerant and facilitate editing and use by users, but also be convenient for automatically adjusting the identified title into a correct typesetting format by utilizing the identified title, thereby facilitating the operation of the users and improving the user experience.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.
Fig. 1 is a schematic diagram of an example of title editing provided by an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of correct layout of titles according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of a title mistypesetting according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for identifying a title according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating a document interaction in a method for identifying a title according to an embodiment of the present invention;
FIG. 6 is a flow chart of another method for identifying a title according to an embodiment of the present invention;
FIG. 7 is a diagram of a document to be identified according to an embodiment of the present invention;
FIG. 8 is a diagram of another document to be identified according to an embodiment of the present invention;
FIG. 9 is a diagram of another document to be identified according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an apparatus for identifying a title according to an embodiment of the present invention;
FIG. 11 is a diagram illustrating an example of a hierarchical title edit provided by an embodiment of the present invention;
FIG. 12 is a diagram illustrating an example of a hierarchical correct layout of titles according to an embodiment of the present invention;
FIG. 13 is a diagram illustrating an example of a hierarchical title misclassification according to an embodiment of the present invention;
FIG. 14 is a flowchart illustrating a method for identifying hierarchical titles according to an embodiment of the present invention;
FIG. 15 is a schematic diagram illustrating interaction of documents in a method for identifying hierarchical headings according to an embodiment of the present invention;
FIG. 16 is a flow chart of another method for identifying hierarchical titles provided by an embodiment of the present invention;
FIG. 17 is a diagram illustrating a document to be identified according to an embodiment of the present invention;
FIG. 18 is a diagram of another document to be identified according to an embodiment of the present invention;
FIG. 19 is a block diagram of an apparatus for identifying hierarchical titles according to an embodiment of the present invention;
fig. 20 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments made by those of ordinary skill in the art based on the embodiments of the present invention should fall within the scope of the disclosure.
Currently, when a user edits a document by using document editing software, the user may need to edit a part of the document with a fixed typesetting format. Such as administrative documents.
In the administrative official document, there is a fixed layout format for the title. For example, the Chinese characters with the size of 2 # small symbol are arranged in two empty rows below the red separation line and are arranged in the middle of one or more rows; when going back, the words and meanings are complete, the arrangement is symmetrical, the length is proper, the space is proper, and the title arrangement should be trapezoidal or rhombic.
Or, 2 lines are arranged in the space below the red reverse line, and the Song font characters marked with the number 2 are arranged in the middle in one line or multiple lines; when going back, the words and meanings are complete, the arrangement is symmetrical, and the space is proper.
When editing a document, a user typically edits the title content using the format of the body text. Specifically, after the title content is edited in the text format, the alignment mode of the title content is set to be centered alignment, and the font size of the title content is adjusted.
A specific example is shown in fig. 1. Fig. 1 is a schematic diagram of an example of title editing provided by an embodiment of the present invention.
Users often need to manually adjust the typesetting format for the content of the title in the document, which is not only cumbersome to operate, but also prone to errors. For example, when the number of title contents is large and line feed is required, a cumbersome operation is often required to arrange the title contents in a trapezoid or diamond shape. The title can be generally recognized and automatically typeset by software. Current document editing software typically does not support identification of a title.
For convenience of understanding, as shown in fig. 2, fig. 2 is a schematic diagram of an example of correct layout of titles according to an embodiment of the present invention. In fig. 2, the title edited in fig. 1 is manually modified by the user according to the layout format specification of the title.
If the user cannot identify the title by using the intelligent identification function of the document editing software or other intelligent identification plug-ins, the title is usually adjusted according to the text format used when the user edits the title content, and the title is identified as the text. Specifically, the alignment mode of the title may be set to be left aligned, and 2 characters are left in the left side of the title.
As shown in fig. 3, fig. 3 is a schematic diagram of an example of title mistypesetting according to an embodiment of the present invention. In FIG. 3 is the user's adjusted layout for the title edited in FIG. 1 after the smart identification plug-in identifies the body text.
In order to solve the above problems, the present invention provides a method of recognizing a title. The method can be applied to document editing software and other software.
In the method for identifying the title provided by the invention, the title can be identified according to the characteristics of the title in the document. Specifically, the probability that a paragraph is a title may be further determined by using features of the paragraph in the document, such as word size, alignment, and paragraph content, so as to further determine the title in the document. After the title in the document is identified, the title is conveniently and automatically adjusted to be in a correct typesetting format, so that the operation of a user is facilitated, and errors are avoided.
Certainly, after the title is identified, the main title and the subtitle in the title can be further identified, so that the correct typesetting formats can be conveniently adjusted for the main title and the subtitle respectively, and the user experience is further improved.
After the main title is identified, the possible title types may also be determined based on the content in the main title. Such as meeting minutes, survey reports, interviews, and the like. Therefore, the corresponding format or content template can be recommended for the user according to the determined title type, the use by the user is facilitated, and the user experience is further improved.
For example, after the title type is determined to be the survey report, a plurality of content templates of the pre-stored survey report can be called out for the user to select and use.
A method for identifying a title provided by an embodiment of the present invention is described in detail below with reference to specific embodiments.
Before describing the method in detail, first, the concepts involved in the process flow of the method are explained.
Document: specifically, the object edited by the user in the document editing software may be composed of characters, and may include at least 1 paragraph. The paragraphs can be distinguished by carriage return symbols.
It should be noted that there is an order between paragraphs in a document, and that there is a corresponding paragraph number for each paragraph. Specifically, the document may start from the first paragraph of the document head, and the paragraph number is sequentially incremented from 1 to the last paragraph of the document.
In addition, a title and a body may be included in the document. Of course, the number of titles and bodies in a document is not limited, and a particular document may include multiple titles and multiple bodies. For example, the plurality of titles are the titles of the first chapter, the second chapter, the tenth chapter of the document, respectively, and there may be a corresponding body for each chapter.
Title: specifically, one or more paragraphs that mark the content of the document may be used as a summary of the content of the document. Paragraphs that are headings typically have a particular layout, for example, the alignment is typically centered and the font size is typically larger than the font size of the body.
In a single title, one or more title names may be contained in their entirety.
For example, under the heading "a marketing method: in the "car sales" example, two title names are included as a whole, namely "a marketing method" and "car sales" as an example. For convenience of description, the title name may be collectively referred to as a title name. In a single title, one or more title names may be included.
The different title names in a single title may be a main title and a subtitle, respectively. The subheadings may be supplemental or explanatory of the main heading. Where the subtitles typically have a particular composition, for example, the alignment is typically either center or right, and the font size is typically larger than the font size of the body but smaller than the font size of the main subtitle.
It should be noted that a single title may contain a main title, or a main title and a subtitle.
Paragraph features: specifically, the paragraph itself may have some features, such as alignment, font size, maximum font size, minimum font size, font style, paragraph content, whether a number is included, the format of the included number, whether bolding is added, and the like. By paragraph features, it can help to determine whether the paragraph itself belongs to the title or body.
As shown in fig. 4, fig. 4 is a flowchart illustrating a method for identifying a title according to an embodiment of the present invention. Which comprises the following steps.
S101: and acquiring the document to be identified.
Alternatively, the document to be recognized may be any document, and for convenience of description, the document used for recognizing the title is referred to as the document to be recognized.
In an alternative embodiment, the electronic device for editing the document to be recognized may execute the method flow locally and not interact with other devices. In a specific example, the method flow may be implemented by local logic included in a document editing software client.
In another alternative embodiment, other electronic devices interacting with the electronic device editing the document to be recognized may also perform the method flow. In a specific example, the method flow may be implemented by a server interacting with a document editing software client.
Therefore, in an alternative embodiment, the client may send any document to the server based on the operation of the user, that is, the server receives the document to be identified sent by the client, so that the server identifies the title in the document to be identified, and returns the identification result to the client. The recognition result may be a document containing a title label, or may be a paragraph identifier recognized as a title, and the paragraph identifier may specifically be a paragraph number.
S102: and determining a title paragraph set in the document to be identified according to the first paragraph features.
Alternatively, the first paragraph feature may be a paragraph feature used to help determine the title. The method specifically comprises the following steps: whether the code, the font size, the bolding, the alignment mode and the like are included.
The determined set of title paragraphs may be one or more.
In an alternative embodiment, specifically, the probability that each paragraph in the document to be recognized belongs to the body is determined based on the first paragraph feature. Specifically, the probability that the paragraph belongs to the text may be output for the first paragraph feature of the input paragraph by using a machine learning method.
Further, the paragraph with the text probability larger than the preset probability can be determined, so as to determine the maximum word size of the text in the paragraph. And identifying all paragraphs with the font size larger than the maximum font size of the text in the document to be identified as title paragraphs.
Alternatively, the identified title paragraph may be traversed; under the condition that the paragraph number of the currently traversed title paragraph is continuous with the paragraph number of the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in the same title paragraph set; in the case where the paragraph number of the currently traversed title paragraph is not consecutive to the paragraph number of the next title paragraph, the currently traversed title paragraph is collected in a different set of title paragraphs from the next title paragraph.
One or more sets of title paragraphs may be collected by the above-described embodiments. A title paragraph set may include one title paragraph or a plurality of consecutive paragraph numbers.
The paragraph number sequence can characterize the sequence number of the paragraph is increased according to the sequence, and each time, the sequence number can be increased by 1. For example, consecutive paragraph numbers include 12,13, 14.
In this embodiment, the title in the document to be recognized can be determined by determining the title paragraph set, so as to facilitate the subsequent steps. And the first paragraph characteristic is used for identification, so that the identification efficiency can be improved, and the identification accuracy can be improved.
S103: and combining the paragraphs belonging to the same title name in the title paragraph set to obtain the title name paragraph set.
Optionally, S103 may be performed for each set of title paragraphs determined in S102. Specifically, S103 may be executed by traversing each determined title paragraph set.
A single set of title paragraphs may be a single title identified from the document to be identified, and a single title may contain one or more title names. Thus, for a single set of title paragraphs, the resultant set of title name paragraphs may be one or more by merging paragraphs belonging to the same title name. A single title name paragraph set may be a single title name identified from a title. The title name may be determined as a main title, and may also correspond to a title type. Thus, a title name paragraph set may be determined as a main title, and may also correspond to a title type.
S104: and determining the title name paragraph set meeting the preset main title condition as the main title.
Alternatively, S103 and S104 may be executed for each title paragraph set determined in S102, and a main title may be determined from the title name paragraph set obtained in S103. Specifically, S103 and S104 may be performed by traversing each determined set of title paragraphs.
The single set of title paragraphs may be a single title identified from the document to be identified and the set of title name paragraphs may be title names determined from the single title, so that the main title may be further determined from the determined title names.
For S103, a single title paragraph set may be a single title identified from the document to be identified, and the single title may contain one or more title names. Which may include both main titles and sub-titles, or only main titles.
Usually, when editing the same title name, the same title name is often edited in the same paragraph. However, in an actual document editing, the same title name may be edited in a plurality of paragraphs.
For example, when the title name content is long and needs to occupy multiple lines of content in a document, a user may, by convention, wrap around so that the same title name is edited in successive paragraphs.
A specific title name editing example is as follows.
About issuing "xx Notification of Business
Notification to employees for summary of learning
Here, "notification about issuing" xx business notices "and" notices to respective employees for learning summary "are two paragraphs, respectively. These two paragraphs belong to the same topic name "notice on posting" xx business notices "to each employee for summary study".
Therefore, in order to be compatible with the editing situation, all paragraphs belonging to the same title need to be determined, so that the complete title can be accurately identified, the influence of line folding is avoided, the identification accuracy of the title is improved, and the subsequent identification of the main title is facilitated.
Since the title name itself is a complete title name, even if a title name contains a plurality of paragraphs, the paragraph characteristics between the contained paragraphs may be the same. For example, the font size and alignment may be the same. Also, the paragraph numbers between the plurality of paragraphs included in one title are consecutive.
The paragraph features of the included paragraphs may be different between different title names.
Taking the main title and the subtitle as an example, if there is a subtitle in a title, the subtitle is usually in a different composition format from the main title. For example, the alignment of the main header is typically centered and the alignment of the sub-header is typically right; the font size of the main title is typically larger than the font size of the subtitles.
Obviously, different layout formats are typically used when editing the main title and the subtitle in the document. The layout format of the paragraphs contained in the main title is different from the layout format of the paragraphs contained in the subtitle. For example, different alignments or different font sizes.
Therefore, whether different paragraphs belong to the same title can be determined by judging whether the characteristics of paragraphs between the paragraphs with consecutive paragraph numbers are the same.
Optionally, in S103, merging paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set, which may specifically include: and collecting the title paragraphs with the same characteristics and continuous paragraph numbers in the title paragraph set as a title name paragraph set.
Optionally, for each set of title paragraphs determined in S102, the title paragraphs in each set of title paragraphs that have the same second paragraph characteristics and consecutive paragraph numbers are collected as a set of title name paragraphs, so that one or more sets of title name paragraphs can be obtained.
In this embodiment, the title name paragraph set can be determined conveniently and quickly by the second paragraph feature and the feature of the paragraph number being continuous, so that the recognition efficiency can be improved.
Optionally, the second paragraph feature may be a paragraph feature for distinguishing different title names, and may specifically include a word size and/or an alignment manner.
In this embodiment, different second paragraph characteristics may be defined to accommodate different editing situations of the title in the document to be recognized.
In an alternative embodiment, the second paragraph feature may include a font size without an alignment.
In the actual process of editing a document, the word sizes of different paragraphs may be the same but the alignment is different in the same title name.
For example, a title name may contain multiple paragraphs due to its long length. Where a section occupies a complete line of content of the document. From the user perspective, it is often difficult to directly distinguish the alignment of the paragraphs, so that the center alignment may be set only for other paragraphs.
A specific title example is shown below.
About issuing "xx New pay System, Performance New affirmation mode and promotion New degree" to xx
Notification of affiliates
Which contains 2 paragraphs, since the paragraph "about printing" xx new payroll system, performance new qualification method and promotion new degree "to xx" occupies a complete line of contents, the user may mistakenly think that the paragraph has been set in centered alignment, and thus only the second paragraph is set in centered alignment.
Thus, the alignment between different paragraphs contained in the same title name may be different.
To accommodate this editing, further facilitating user editing and use, the second paragraph feature may be defined to include a font size, rather than an alignment.
In another alternative embodiment, the second paragraph feature may include an alignment pattern without including a font size.
In the actual process of editing a document, it may appear in the same title name, and different paragraphs may be aligned in the same way, but with different font sizes.
For example, a section in one title name is uniform in font size because it is edited by inputting characters, while other sections may be non-uniform in font size because they copy-paste other document contents. However, due to the similar sizes of the font sizes, for example, the difference between the four-size font size and the small four-size font size is often difficult for users to distinguish. From a user perspective, it may be difficult to find a difference in the size of a title in a title name. Or the paragraphs with different word sizes in the same title name are ignored due to the carelessness of the user.
A specific title example is shown below.
About issuing "xx New pay System, Performance New affirmation mode and promotion New degree" to xx
Notification of affiliates
The paragraph includes 2 paragraphs, since the paragraph "about printing" xx new payroll system, performance new determination mode and promotion new degree "to xx" is a five-character number, and the paragraph "notice of division" is a small four-character number. It is often difficult for a user to find that the title word sizes are different, or that the title word sizes are not different due to carelessness or the like.
Thus, the font size may differ between different paragraphs contained in the same title name.
To accommodate this editing, further facilitating user editing and use, it may be defined that the second paragraph feature includes an alignment, rather than a font size.
In an alternative embodiment, the second paragraph feature may also contain other paragraph features, such as the font of the paragraph. Song and regular script may also be used to distinguish title names.
Since in some cases both the main title and the subtitle are centrally aligned, in an alternative embodiment, the alignment may be defined as centrally aligned.
Alternatively, the title paragraphs in the title paragraph set that have the same characteristics and are aligned in a centered manner and with consecutive paragraph numbers may be collected as a title name paragraph set.
In an alternative embodiment, specifically, collecting the title paragraphs in the title paragraph set, which have the same characteristics as the second paragraph and have consecutive paragraph numbers, as a title name paragraph set may include: traversing the title paragraphs in the collection of title paragraphs according to the paragraph numbers; under the condition that the second paragraph features of the currently traversed title paragraph are the same as the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in the same title name paragraph set; in the event that the second paragraph feature of the currently traversed title paragraph is different from the next title paragraph, the currently traversed title paragraph is collected in a different set of title name paragraphs than the next title paragraph.
Since the title paragraph set includes the title paragraphs with consecutive paragraph numbers, the title paragraphs in the title paragraph set can be directly traversed according to the paragraph numbers, and the paragraph numbers of the two preceding and following title paragraphs that are traversed are consecutive.
Alternatively, the title paragraphs in the set of title paragraphs may be traversed in order from smaller to larger according to the paragraph number, or the title paragraphs in the set of title paragraphs may be traversed in order from larger to smaller according to the paragraph number.
Further, if the second paragraph features of the two preceding and succeeding title paragraphs are the same, it may be determined that the two title paragraphs belong to the same title name and may be collected in the same title name paragraph set.
In this embodiment, by means of the title paragraphs with consecutive paragraph numbers contained in the title paragraph set, the title paragraphs with the same second paragraph characteristics and consecutive paragraph numbers can be determined conveniently and quickly in a traversal manner, so that the title name paragraph set is determined, and the recognition efficiency is improved.
In other alternative embodiments, the title paragraphs in the title paragraph set that have the same second paragraph feature and consecutive paragraph numbers are collected as a title name paragraph set, or the title name paragraph set may be obtained by determining paragraphs in the title paragraph set that have different second paragraph features and segmenting the determined paragraphs as boundaries.
With respect to S104, in an alternative embodiment, determining the title name paragraph set meeting the preset main title condition as the main title may include: traversing the title name paragraph set, and judging whether the currently traversed title name paragraph set meets the preset main title condition; and under the condition that the currently traversed title name paragraph set meets the preset main title condition, determining the currently traversed title name paragraph set as a main title.
Alternatively, traversing the set of title name paragraphs may be traversing each set of title name paragraphs found in the set of title name paragraphs for each set of title paragraphs determined in S102.
In this embodiment, the main title may be accurately determined by traversing the title name paragraph set.
In other alternative embodiments, the title name paragraph set meeting the preset main title condition may also be determined directly according to the preset main title condition without traversing.
For example, in the case of traversing the paragraph mergers to obtain a title name paragraph set, the preset main title condition may be the first title name paragraph set obtained by merger, so that the first title name paragraph set obtained by merger may be directly determined as a main title.
For the preset main title condition, in an alternative embodiment, the main title may be determined according to the characteristics of the main title.
Typically, the main title is at the beginning of the title, while the sub-titles are usually edited after the main title. Therefore, optionally, the main heading must contain the first paragraph in the heading, i.e., the paragraph with the smallest paragraph number in the paragraphs contained in the heading. Thus, the preset main title condition may include: among the title paragraphs included in the title name paragraph set, there is a title paragraph with the smallest paragraph number in the corresponding title paragraph set.
Optionally, the paragraphs belonging to the same title name in the title paragraph set are merged to obtain a title name paragraph set, and the obtained title name paragraph set may correspond to the title paragraph set. In other words, a set of title paragraphs may correspond to a set of title name paragraphs that are merged into the paragraphs therein.
Of course, the preset main title condition may include other conditions. In other alternative embodiments, the preset main title condition may include: and arranging and then ordering the first title name paragraph set according to the position of the title name paragraph set in the document to be identified in the order from front to back.
Correspondingly, in an optional embodiment, the determining whether the title name paragraph set meets the preset main title condition may include: and determining that the title name paragraph set meets a preset main title condition when the title paragraph with the smallest paragraph number exists in the title paragraphs in the title paragraph set.
Optionally, in a case where there is no title paragraph with the smallest paragraph number in the title paragraph set among the title paragraphs included in the title name paragraph set, it is determined that the title name paragraph set does not meet the preset main title condition.
In this embodiment, whether the title name paragraph set meets the preset main title condition is simply and directly determined by the title paragraph with the smallest paragraph number in the title paragraph set, so as to improve the recognition efficiency.
It should be noted that, since there is only one main title in a title, traversing the set of title name paragraphs may be optionally stopped after determining the main title, thereby saving computing resources and improving recognition efficiency.
After the main title is determined by the above method, the subtitle may be further determined, and there is a need for a partial scene to have the subtitle recognized, and therefore, alternatively, the title name paragraph sets that are not determined as the main title may be all determined as the subtitles after the main title is determined.
Alternatively, a set of title name paragraphs that do not meet the preset main title condition may be determined as a subtitle.
The method specifically comprises the following steps: in the case where the title name paragraph set does not meet the preset main title condition, the title name paragraph set is determined as a subtitle.
In this embodiment, the subtitle may be additionally determined, so that the typesetting format of the identified subtitle is conveniently adjusted, and the user operation is facilitated.
In addition to determining the main title and the sub-title, the method flow provides an alternative embodiment for determining the title type corresponding to the title. The determined title type can help recommend typesetting or content formats required for editing the document to the user, and the user can conveniently use the typesetting or content formats.
In an optional embodiment, the title types can be determined respectively for the main title and the sub-title, so that the content template recommended to the user can be determined comprehensively according to the title type of the main title and the title type of the sub-title, the use by the user is facilitated, and the user experience is improved.
The title name paragraph set determined in the above embodiment may be regarded as a title name, and the title type is determined for the title name paragraph set.
Therefore, optionally, in the process of traversing the title name paragraph set, before judging whether the title name paragraph set meets the preset main title condition, preprocessing may be performed on the currently traversed title name paragraph set to obtain a title character string; and determining the title type according to the key characters contained in the title character string under the condition that the obtained title character string contains any key character in the key character set.
The title type may correspond to a title name paragraph set one by one, and any key character in the key character set corresponds to a title type.
Optionally, in a case where the obtained title character string contains a single key character in the key character set, the uniquely corresponding title type may be determined directly according to the contained single key character.
Optionally, when the obtained title character string includes a plurality of key characters in the key character set, according to a preset priority of a title type, a title type with a highest priority is determined as a unique corresponding title type from a plurality of title types respectively corresponding to the plurality of key characters included.
Optionally, in a case where the obtained title character string includes a plurality of key characters in the key character set, the uniquely corresponding title type may be determined according to positions of the included plurality of key characters in the title character string. Generally, the accuracy of the title type determined according to the key character at the end of the title character string is high, and therefore, the title type corresponding to the key character at the position closest to the end of the title character string in a plurality of contained key characters can be determined as the only corresponding title type according to the position of the key character in the title character string.
In other alternative embodiments, the set of title name paragraphs may also correspond to multiple title types. In subsequent steps, content templates may be recommended to the user according to each corresponding title type.
In the normal document editing process, the same title name should be edited in a paragraph, and even if line feed occurs, the paragraph does not need to be added. However, when a user edits a document, the same title name may be separated into multiple paragraphs by carriage returns for habit or other reasons.
In order to accommodate such editing and improve the user experience, in an alternative embodiment, the title name may be divided into a plurality of paragraphs, for example, too many title name contents may cause a line change or a paragraph addition, and of course, the main title and the subtitle may also be divided into a plurality of paragraphs, so that when a title name paragraph set is specifically processed, a plurality of paragraphs in the title name paragraph set need to be re-spliced together to form a complete character string.
For ease of understanding, a specific example is provided below. One title is schematically shown below.
About issuing "xx teaching course courseware and experience summary" to
Notification of learning summary by each teacher
Since the "notification of each teacher summarizing learning" is divided into another paragraph, in order to be compatible with this editing situation, when the title type is determined for this title name, these 2 paragraphs need to be concatenated to obtain "notification about posting" xx teaching course courseware and experience summary "to each teacher summarizing learning", thereby facilitating subsequent operations.
Further, as can be seen from the above title example, titles or references of other files may exist in the title content, which may result in an erroneous determination of the title type of the title name.
Therefore, the titles or references of other files in the splicing result can be deleted, thereby improving the accuracy of the title types.
In summary, in an optional embodiment, the preprocessing performed on the title name paragraph set to obtain the title character string may include: characters contained in title paragraphs in the title name paragraph set are spliced, and preset characters are deleted according to a splicing result to obtain a title character string.
Optionally, the preset characters may include: blank characters; and/or characters between preset symbols.
The preset symbol may include at least one of the following: brackets, book title number, and quotation marks. Wherein the parentheses may include square brackets, large brackets, middle brackets, small brackets, and the like. Optionally, the preset symbol may be further deleted.
In this embodiment, the accuracy of identifying the title type can be improved by preprocessing. And the preset characters are deleted, so that the accuracy rate of identifying the title types can be further improved.
After the preprocessing, it is further possible to determine whether any key character in the key character set is contained in the title character string.
Alternatively, a set of key characters and a title type corresponding to each key character in the set of key characters may be preset and stored.
For example, the set of key characters may include: { "report of job", "meeting record", "survey report", "work plan", "assessment standard", "index system", "selection activity", "policy interpretation", "public guide", "work arrangement", "notice", "work dynamics", "bidding procedure", "work flow", "important approval", "work order", "work information", "project design", "project conclusion", "evaluation criteria", "data compilation", "technical parameters", "propaganda slogan", "listing list title", "intention contract", "cost research", "propaganda slogan", "transaction procedure", "instruction content", "implementation scheme", "reference data", "teaching objective" }.
And the title types may include: { notify, plan, decide, list, criteria, flow, specification, truth }.
The keyword set and the title type are merely exemplary descriptions and do not limit the scope of the process flow of the method.
Alternatively, each key character may correspond to one title type, different key characters may correspond to different title types, or there may be a plurality of key characters corresponding to the same title type.
For example, the key characters "work notification", "vacation notification", "notice book", "price reduction notification", etc. may all correspond to the same title type "notification".
For ease of understanding, a specific example is provided below. One title is schematically shown below.
About issuing "xx teaching course courseware and experience summary" to
Notification of learning summary by each teacher
After preprocessing, the title string "notification about the transfer to each teacher for learning summary" can be obtained. And then according to the key character 'notice' contained in the title, the title type corresponding to the title can be determined to be 'notice'.
Furthermore, in an alternative embodiment, the title type may also be determined only for the title paragraphs that are determined to be main titles, since in general, subtitles are an interpretation and complement of main titles and generally do not have a title type.
For example, a subheading may be "in the example of xx teacher demonstration experience" that does not contain key characters, nor does it have a heading type.
Therefore, in order to save computational resources and improve efficiency, the set of title name paragraphs determined to be the main title may optionally be preprocessed. For the explanation of the preprocessing, reference may be made to the above embodiments, so that the title type corresponding to the main title may be further determined based on the title character string obtained through the preprocessing.
In an alternative embodiment, the determining the title name paragraph set meeting the preset main title condition as the main title may specifically include determining each title paragraph in the title name paragraph set meeting the preset main title condition as the main title. So that the layout for the title paragraphs determined as the main title can be facilitated.
Correspondingly, optionally, in a case that the set of title name paragraphs does not meet the preset main title condition, the set of title name paragraphs may be determined as a subtitle, and specifically, the determining may include determining each title paragraph in the set of title name paragraphs that does not meet the preset main title condition as a subtitle.
After determining the title type corresponding to the main title, in an alternative embodiment, the recommended template needs to be determined according to the title type, and the typesetting format needs to be determined according to the main title. Therefore, in order to facilitate the execution of the subsequent steps, a set of title name paragraphs determined to be a main title may be marked.
Optionally, determining the title name paragraph set as the main title may include: the mark is marked as a title paragraph in the title name paragraph set determined as the main title according to the title type corresponding to the main title and the title name paragraph set determined as the main title.
In this embodiment, marking the title paragraphs can facilitate determining whether the title paragraphs belong to the main title, and can further facilitate determining the title types corresponding to the title paragraphs, thereby facilitating typesetting for the title paragraphs and improving the recognition efficiency.
Correspondingly, after determining the title type corresponding to each title name paragraph set, in a case that the title name paragraph set does not meet the preset main title condition, the title name paragraph set may be determined as a sub-title, and particularly, the title paragraphs in the title name paragraph set may be marked according to the sub-title and the title type corresponding to the title name paragraph set.
In this embodiment, marking the title paragraphs can facilitate determining whether the title paragraphs belong to subtitles, and further facilitate determining the title types corresponding to the title paragraphs, thereby facilitating typesetting for the title paragraphs and improving recognition efficiency.
In the case where the electronic device for editing the document to be recognized locally executes the process of the method, the typesetting format can be automatically adjusted directly based on the recognition result.
In a case where the process of the method is executed by another electronic device interacting with the electronic device editing the document to be recognized, in an optional embodiment, after all titles are determined for the document to be recognized, the recognition result may be further returned to the electronic device editing the document to be recognized, and of course, the recognition result may be returned after the layout format is adjusted.
Optionally, the paragraph identifier corresponding to each identified title may be returned to adjust the layout for the paragraph corresponding to the received paragraph identifier. A particular paragraph identification may be a paragraph number.
Optionally, the layout of each identified title in the document to be identified may be adjusted, and the adjusted document to be identified may be returned.
Fig. 5 is a schematic diagram illustrating a document interaction in a method for identifying a title according to an embodiment of the present invention, as shown in fig. 5.
The user can trigger the operation for identifying the title through the client aiming at the document to be identified, so that the client sends the document to be identified to the server. The client may specifically be a client of document editing software.
After the server identifies the title in the document to be identified, the server may return the paragraph identifier corresponding to the paragraph contained in the title to the client.
In this embodiment, by returning the recognition result to the client, the client can conveniently and automatically perform typesetting adjustment, so that manual adjustment of a user is avoided, user operation is facilitated, and user experience is improved.
Through the method and the process, the title paragraph set and the main title in the title paragraph set can be determined by utilizing the paragraph characteristics of the title, and the method and the system are compatible with various conditions that a user may edit errors, and can identify the title under various conditions in the complex document editing condition. The method and the system can not only be fault-tolerant and facilitate editing and use by users, but also be convenient for automatically adjusting the identified title into a correct typesetting format by utilizing the identified title, thereby facilitating the operation of the users and improving the user experience.
In addition, it may help to identify subtitles in the title to more accurately adjust the composition format based on the subtitles. And the title type can be determined according to the title, so that the document typesetting or the document content format and the like corresponding to the title type can be recommended to the user according to the title type, and the user experience is improved.
The title identified by the method flow can be used for the proofreading of the format of the subsequent administrative official document and the conversion of the administrative official document.
Specifically, when the administrative official document proofreading function provided by the software is used, the title in the document is identified through the above method flow, and whether the typesetting format of the identified title meets the specification is further checked. If the result meets the regulation, returning the result passing the proofreading; if the title layout format does not meet the regulation, the result of the failed proofreading can be returned, or the layout format of the identified title is further automatically adjusted, so that the adjusted layout format meets the regulation.
Alternatively, when the administrative official document conversion function provided by the software is used, the title in the document is identified through the above method flow, and the typesetting format of the identified title is further automatically adjusted to meet the specified typesetting format.
In addition, one or more titles may be contained in the document to be identified, and different titles may correspond to different body parts. For example, when a novel or magazine is edited in a document, it may typically contain a plurality of articles, and different articles may contain different titles and corresponding different texts.
For convenience of description, a title and corresponding text in the document to be recognized are collectively referred to as a sub-document. Thus, one or more subdocuments may be included in the document to be identified.
For example, a plurality of chapters may be edited in one document, each chapter content including a chapter title and a chapter body, and each chapter content may be considered a subdocument. The attachment content may also be edited in the document, which may be considered a sub-document.
In the above method flow, S102 may identify one or more headings, i.e., a set of heading paragraphs, in the document to be identified.
Since each sub-document includes a title, in an alternative embodiment, the sub-documents in the document to be identified may be identified first, and then further identification may be performed for the title of each sub-document.
Alternatively, S103 and S104 may be specifically executed for the identified header portion, i.e., the header paragraph set, of each sub-document, and the main header of each sub-document is determined. Correspondingly, subtitles may also be determined. For details, reference may be made to the above explanations of the process flows.
For example, when it is identified that the document to be identified only includes one sub-document, the main title of the single sub-document may be determined by using the above method flow for the title of the identified single sub-document, or the sub-title or the title type of the single sub-document may be further determined, so as to recommend the content template for the single sub-document.
When the document to be identified comprises a plurality of sub-documents, the main title of each sub-document can be determined by the method flow aiming at the title of each identified sub-document, and the sub-title or the title type of each sub-document can be further determined so as to recommend a content template aiming at each sub-document.
For convenience of understanding, the embodiment of the invention also discloses an application embodiment.
As shown in fig. 6, fig. 6 is a flowchart of another method for identifying a title according to an embodiment of the present invention.
S201: a header area is identified. The method specifically comprises the following steps: and identifying the sub-document in the user document, and acquiring a list of paragraph areas of the attachment, the title and the text.
S202: a paragraph area list of title names is collected. The method specifically comprises the following steps: and traversing paragraphs in the paragraph areas of the titles, combining paragraphs with the same word size and the same alignment mode and continuous paragraph serial numbers into a paragraph area of a title name, and finally obtaining a paragraph area list of the title names under a plurality of subdocuments.
S203: the main title and the subtitle are divided. The method specifically comprises the following steps: and traversing the identified subdocuments, and identifying the title types corresponding to the title names in the subdocuments. Wherein each paragraph in the paragraph area of the first title name of each sub-document is labeled as "main title _ title type", and the others are labeled as "subtitle _ title type".
In order to facilitate further understanding, the embodiment of the invention also provides 2 specific application embodiments.
The first embodiment is applied.
As shown in fig. 7, fig. 7 is a schematic diagram of a document to be identified according to an embodiment of the present invention. Which contains 30 paragraphs, some of which have been labeled with paragraph numbers. Wherein part of the text paragraphs are omitted.
For the document to be recognized, the list of paragraph areas where the attachments, titles, and texts are acquired is [ [2,2, title ], [4,4, text ], [7,8, title ], [9,15, text ], [17,18, title ], [20,30, text ] ].
The sub-documents have title areas of [ [ [2,2, title ] ], [ [7,8, title ] ], [ [17,18, title ] ] ], paragraph areas with character size and alignment mode centered and with continuous paragraph numbers are combined into a title name, and are [ [ [2,2, XX company new edition promotion system and new edition performance calculation mode ] ], [ [7,8 ] a notice about printing of 20YY year version 5 new edition promotion system ] ], [ [17,18,20YY year version 10 new edition performance calculation mode new rule and detailed rule ] ] ].
Traversing the subdocuments, preprocessing the paragraph areas of the topic names in the subdocuments, further obtaining the topic types corresponding to the topic names, and obtaining [ [ [ [2,2, XX company new edition promotion system and new edition performance calculation mode, none ] ], [ [7,8, notification about printing, notification ] ], [ [17,18,20YY year 10 th edition new edition performance calculation mode new rule and detailed rule, none ] ] ].
Each paragraph in the paragraph area of the first title name of each sub-document is marked as "main title _ title type", and the others are "subtitle _ title type". Therefore, the company [2, XX new edition promotion system and new edition performance calculation method, main title _ none ], [7, notification of print "20 YY year version 5 new edition, main title _ notification ], [8, promotion system ], main title _ notification ], [17,20YY year version 10 new edition performance calculation method, main title _ none ], [18, new edition and detailed rules, main title _ none ] ]canbe obtained.
The second application example.
As shown in fig. 8, fig. 8 is a schematic diagram of another document to be identified according to an embodiment of the present invention. Which contains 26 paragraphs, part of which are already indicated with paragraph numbers. Wherein part of the text paragraphs are omitted.
For the document to be recognized, the list of paragraph areas where the attachments, titles, and texts are acquired is [ [1,2, title ], [4,5, text ], [8,8, title ], [10,14, text ], [16,16, attachments ], [17,17, title ], [19,26, text ] ].
The sub-documents have the title regions of [ [ [1,2, title ] ], [ [8,8, title ] ], [ [17,17, title ] ] ], paragraph regions with character size and alignment in a centered manner and with consecutive paragraph numbers are combined into one title name, and the sub-documents have the title regions of [ [ [1, 2] about forwarding the notification of "classic" activity implementation of going forward "of XX middle and primary schools ] ], [ [8,8, XX middle and primary schools" classic "activity implementation of going forward ] ], [ [17,17, XX middle and primary schools" seek out of the classic "activity recommendation list ] ] ].
Traversing the subdocuments, preprocessing the paragraph areas of the topic names in the subdocuments, and further obtaining the topic types corresponding to the topic names to obtain [ [ [1,2, notification and notice about forwarding ] ], [ [8,8, XX implementation scheme and implementation scheme ] of activities of middle and primary schools, [ [17,17, XX activities of middle and primary schools ] and [17,17, XX activities of outer reading recommended directories and directories ] ] ].
Marking each paragraph in the paragraph area of the first topic name of each sub-document as "main topic _ topic type", the other "as sub-topic _ topic type", getting [ [1, notification about forwarding "XX middle and primary school" reading classic "activity, main topic _ notification ], [2, implementation ], main topic _ notification ], [8, XX middle and primary school" reading classic "activity implementation, main topic _ implementation ], [17, XX middle and primary school" reading classic "activity extracurricular selection recommendation directory, main topic _ directory ].
Application example three.
As shown in fig. 9, fig. 9 is a schematic diagram of another document to be identified according to an embodiment of the present invention. Which contains 20 paragraphs, with part segment and paragraph numbers already indicated in the figure. Wherein part of the text paragraphs are omitted.
For the document to be recognized, the list of paragraph areas where the attachments, titles, and texts are acquired is [ [5,6, title ], [8,10, text ], [11,11, attachment segmentation ], [12,13, title ], [15,16, title ], [18,20, text ] ].
Wherein 2 title areas of the sub-document are [ [ [5,6, title ] ], [ [12,13, title ], [15,16, title ] ] ], a paragraph area with a character number, an alignment way being centered and a paragraph number being continuous is merged into one title name, and the paragraph areas are [ [5,6, a notice on XX scenic spot figure article drawing gathering activity ] ], [ [12,13, XX scenic spot figure list ], [15,16, which is listed for reference only according to a part of the history ] ].
Traversing the subdocuments, preprocessing the paragraph areas of the subdocuments aiming at the topic names, further obtaining the topic types corresponding to the topic names, and obtaining [ [ [5,6, notification and notice of the activity of drawing collection of the XX landscape scenic spot figure article ] ], [ [12,13, XX landscape scenic spot figure list, none ] ], [15,16, which is only used for reference and none ] according to partial historical materials listing.
Each paragraph in the paragraph area of the first title name of each sub-document is marked as "main title _ title type", and the others are "subtitle _ title type".
Get [ [ [5,6, notification about XX scenic spot historical site character article painting collection activity, main heading _ notification ] ], [ [12,13, XX scenic spot historical site character list, main heading _ none ], [15,16, by reference to partial historical material listing, sub heading _ none ] ] ].
Corresponding to the method for identifying the title, the embodiment of the invention also provides an embodiment of a device for identifying the title.
As shown in fig. 10, fig. 10 is a schematic structural diagram of an apparatus for identifying a title according to an embodiment of the present invention. The apparatus may include the following modules.
The obtaining module 301 is configured to obtain a document to be identified.
The paragraph set determining module 302 is configured to determine, according to the first paragraph feature, a heading paragraph set in the document to be identified, where the heading paragraph set is one or more.
A title name determining module 303, configured to merge paragraphs belonging to the same title name in a title paragraph set to obtain a title name paragraph set; the set of title name paragraphs is one or more.
A main title determining module 304, configured to determine a title name paragraph set meeting a preset main title condition as a main title.
Optionally, a title name determining module 303, configured to: and collecting the title paragraphs with the same characteristics and continuous paragraph numbers in the title paragraph set as a title name paragraph set.
Optionally, the title name determining module 303 includes: the paragraph traversing sub-module 303a is configured to traverse the title paragraphs in the title paragraph set according to the paragraph numbers.
The collecting submodule 303b is configured to, in a case that a second paragraph feature of a currently traversed title paragraph is the same as that of a next title paragraph, collect the currently traversed title paragraph and the next title paragraph in a same title name paragraph set; in the event that the second paragraph feature of the currently traversed title paragraph is different from the next title paragraph, the currently traversed title paragraph is collected in a different set of title name paragraphs than the next title paragraph.
Optionally, the main title determining module 304 includes: the traversal submodule 304a is configured to traverse the title name paragraph set and determine whether the title name paragraph set meets a preset main title condition.
The condition submodule 304b is configured to determine the title name paragraph set as the main heading if the title name paragraph set meets a preset main heading condition.
Optionally, traversing the sub-module 304a, comprises: the condition judgment sub-module 304a1, configured to determine that the title name paragraph set meets the preset main title condition when a title paragraph with the smallest paragraph number exists in the title paragraphs in the title name paragraph set; and in the case that no title paragraph with the smallest paragraph number exists in the title paragraph set, determining that the title paragraph set does not meet the preset main title condition.
Optionally, the apparatus for identifying a title further comprises: a subtitle determining module 305, configured to determine the title name paragraph set as a subtitle if the title name paragraph set does not meet a preset main title condition.
Optionally, the main title determining module 304 further includes: and the preprocessing submodule 304c is used for preprocessing the title name paragraph set to obtain a title character string.
The title type determining submodule 304d is used for determining the title type according to the key characters contained in the title character string under the condition that the title character string contains any key character in the key character set; the title types correspond to the title name paragraph sets one by one, and any key character in the key character set corresponds to one title type.
Optionally, the preprocessing submodule 304c is configured to splice characters included in title paragraphs in the title name paragraph set, and delete preset characters according to a splicing result, so as to obtain a title character string.
Optionally, the preset characters include: blank characters; and/or characters between preset symbols.
Optionally, the condition submodule 304b includes: the main title tagging submodule 304b1 is configured to tag title paragraphs in the title name paragraph set according to the title types corresponding to the main title and the title name paragraph set.
Optionally, the main title determining module 304 further includes: the subtitle marking sub-module 304e is configured to mark title paragraphs in the title name paragraph set according to the subtitle and the title type corresponding to the title name paragraph set when the title name paragraph set does not meet the preset main title condition.
Optionally, the second paragraph feature comprises: size and/or alignment.
Optionally, the paragraph set determining module 302 includes: the paragraph determining sub-module 302a is configured to determine one or more title paragraphs in the document to be identified according to the first paragraph feature.
A paragraph collection submodule 302b for traversing the title paragraphs; under the condition that the paragraph number of the currently traversed title paragraph is continuous with the paragraph number of the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in the same title paragraph set; in the case where the paragraph number of the currently traversed title paragraph is not consecutive to the paragraph number of the next title paragraph, the currently traversed title paragraph is collected in a different set of title paragraphs from the next title paragraph.
For an explanation of the above-described apparatus embodiments reference is made to the above-described method embodiments.
For the document to be recognized, in addition to the header part in the document to be recognized, there may be a header in the body text in the document to be recognized. The particulars may include a subtitle in the body text.
Therefore, on the basis of the above embodiments for identifying the title, the present specification embodiment also provides an embodiment for identifying the hierarchical title in the text.
Currently, when a user edits a document by using document editing software, the user may need to edit a part of the document with a fixed typesetting format. Such as administrative documents.
In an administrative body, there is a fixed layout format for the hierarchical headings in the body. For example, the sequence numbers of the structural layers in the text may be labeled with "one", "1" "; generally, the first layer is marked with bold characters, the second layer with regular script characters, and the third and fourth layers with sonsy-imitating characters.
When editing a document, a user typically edits hierarchical title content in a body using the format of the body. Specifically, after the hierarchical title content is edited in the text format, the font of the hierarchical title content is adjusted.
A specific example is shown in fig. 11. Fig. 11 is a schematic diagram of an example of hierarchical title editing provided in an embodiment of the present invention.
Users often need to manually adjust the typesetting format for the content of the hierarchical title in the document, which is not only cumbersome to operate, but also prone to errors. For example, at a higher level of title, the designated "one," (one) "" 1. "(1)" callout may not be used; or misadjusting the font of the hierarchical caption; or different formats of numbers in the same hierarchical level of the hierarchical header. The hierarchical titles can be generally identified by software and automatically laid out. Current document editing software typically does not support the identification of hierarchical titles.
For convenience of understanding, as shown in fig. 12, fig. 12 is a schematic diagram of an example of correctly typesetting a hierarchical title according to an embodiment of the present invention. In fig. 12, the hierarchy header edited in fig. 11 is manually modified by the user according to the above-described guess of the layout format specification of the hierarchy header. It can be seen that the font of the primary headline has been modified to be bold, and the font of the secondary headline remains as a regular font.
If the user utilizes the intelligent identification function or other intelligent identification plug-ins of the document editing software, the hierarchical title in the body text cannot be identified, the hierarchical title is usually adjusted according to the body format used when the user edits the hierarchical title content, and the hierarchical title is also identified as the body text, so that the typesetting format of the hierarchical title content cannot be modified.
Fig. 13 is a schematic diagram of an example of a hierarchical title error typesetting according to an embodiment of the present invention. FIG. 13 is a layout adjusted by the user after the smart identification plug-in identifies the body for the hierarchical title edited in FIG. 11. It can be seen that the font of the hierarchical title is not modified, is still a regular font, and does not conform to the typesetting format specification of the administrative official document.
In order to solve the above problems, the present invention provides a method of identifying a hierarchical title. The method can be applied to document editing software and other software.
In the method for identifying the hierarchical titles provided by the invention, the hierarchical titles can be identified according to the characteristics of the hierarchical titles in the document. The method specifically comprises the steps of determining a text paragraph in the document, further determining a hierarchical structure in the text paragraph, further determining a hierarchical title according to the hierarchical structure, and conveniently and automatically adjusting the hierarchical title to a correct typesetting format, so that the method is convenient for a user to operate and avoids errors.
A method for identifying a hierarchical title according to an embodiment of the present invention is described in detail below with reference to specific embodiments.
Before describing the method in detail, first, the concepts involved in the process flow of the method are explained.
Document: specifically, the object edited by the user in the document editing software may be composed of characters, and may include at least 1 paragraph. The paragraphs can be distinguished by carriage return symbols.
It should be noted that there is an order between paragraphs in a document, and that there is a corresponding paragraph number for each paragraph. Specifically, the document may start from the first paragraph of the document head, and the paragraph number is sequentially incremented from 1 to the last paragraph of the document.
In addition, a title and a body may be included in the document. Of course, the number of titles and bodies in a document is not limited, and a particular document may include multiple titles and multiple bodies. For example, the plurality of titles are the titles of the first chapter, the second chapter, the tenth chapter of the document, respectively, and there may be a corresponding body for each chapter.
The document may further include a hierarchical title in the body. For example, when editing a text, the hierarchical titles in the text may be marked with numbers of one, two, three, 1,2, 3, etc., to distinguish different hierarchical levels.
Subdocuments: since a document may contain a plurality of titles and a plurality of texts, for convenience of description, a title and a corresponding text in the document to be identified are collectively referred to as a sub-document. Thus, one or more subdocuments may be included in the document to be identified.
For example, a plurality of chapters may be edited in one document, each chapter content including a chapter title and a chapter body, and each chapter content may be considered a subdocument. The attachment content may also be edited in the document, which may be considered a sub-document.
Hierarchy title: in particular, the title may be marked in the body by a number. Which may include primary titles, secondary titles, tertiary titles, etc.
The titles of different levels have a hierarchical relationship, the level of the first-level title is higher than that of the second-level title, the level of the second-level title is higher than that of the third-level title, and the like. In an alternative embodiment, the hierarchical headers may include only primary, secondary, tertiary, and quaternary headers.
In addition, the hierarchical title in the body may correspond to a part of the content in the body. And the main body part content corresponding to one or more secondary titles can be contained in the main body part content corresponding to a single primary title.
For example, paragraphs may be marked in the body with numbers in one, two, or three formats as a first level heading, dividing different parts of the body. In the text portion corresponding to a single primary title, the division may be further performed by using numbers in other formats, specifically, paragraphs may be marked by using numbers in (one), (two), and (three) formats as secondary titles, and different portions in the text portion corresponding to the primary title are divided. By analogy, a three-level title and a four-level title can be obtained.
When different parts in the body are divided by the hierarchical titles, the divided body parts may be determined as body parts corresponding to the hierarchical titles.
In a specific example, the body may contain a first chapter, a second chapter, and a third chapter, specifically a first-level title. The text content corresponding to the first chapter may include a first section, a second section, and a third section. The text content corresponding to the second chapter may also include a first section, a second section, and a third section. The first, second and third sections may all be secondary titles.
It should be noted that, since the hierarchical title contained in the body text is not a document title in a strict sense, the user may edit the hierarchical title and the corresponding body text part content in the same paragraph when actually editing the hierarchical title. For convenience of description, a paragraph containing hierarchical title and body part content is referred to as a title body. Therefore, the specific content format of the hierarchical title may be a title or a body of the title.
The numbering format is as follows: specifically, the number may be in an edited form.
The format of the numbers may include: the language form of the number, whether the number carries parentheses, the fixed format of the number (chapter i, 1.1), etc. The method specifically comprises the following steps: the language form of the number is Chinese characters, numbers or letters and the like. E.g., one, (1), 1, 1.1, chapter i, etc.
For example, the numbering formats of the first, second and third are the same, all are in Chinese character form and do not carry parentheses. (1) The numbers (2), (3) may be the same format; 1.2, 2.1, 3.4 may be numbers of the same format; the first, second and third numbers can be numbers in the same format; one, two and three can be numbers with the same format; the first chapter, the second chapter, the third chapter can be numbers in the same format; 1.2, 3 may be numbers of the same format.
In this document, hierarchical titles of the same hierarchy level may contain numbers in the same format, and hierarchical titles of different hierarchy levels may contain numbers in different formats for distinction.
Hierarchical structure: specifically, the structure may be used to characterize the hierarchical relationship between hierarchical titles in the text. The format of the numbers contained in the hierarchical titles and the hierarchical relationship among the hierarchical titles can be contained.
Wherein the numbering format can be used to characterize the corresponding hierarchical title in the body.
In order to characterize the hierarchical relationship between the hierarchical titles, the hierarchical structure may be sorted according to the hierarchical relationship, or the hierarchy corresponding to the hierarchical title itself (e.g., the first hierarchical level corresponding to the first hierarchical title, and the second hierarchical level corresponding to the second hierarchical title) may be directly corresponding to the numbering format in the hierarchical structure.
Several specific examples of hierarchies are given below, wherein the hierarchies may be arranged from high to low according to the hierarchical relationship, the hierarchies may be indicated using brackets, and the numbering formats in the hierarchies may be separated using commas, and the hierarchies may be gradually lowered from left to right. For example, [ one, (one), 1, (1) ], [1, 1.1, 1.1.1], [ chapter i, section i, bar i ] or [ chapter i, 1.1, 1.1.1 ].
Of course, the present description does not limit the specific form of the hierarchical structure. The specific form of the hierarchical structure may further include a hierarchy, and specifically may be [ level one: the method comprises the following steps of I, II: (I), three-stage: 1, four stages: (1)].
Paragraph features: specifically, the paragraph itself may have some features, such as alignment, font size, maximum font size, minimum font size, font style, paragraph content, whether a number is included, the format of the included number, whether bolding is added, and the like. By paragraph features, it can help to determine whether the paragraph itself belongs to the title or body.
As shown in fig. 14, fig. 14 is a flowchart illustrating a method for identifying a hierarchical title according to an embodiment of the present invention. Which comprises the following steps.
S401: and acquiring the document to be identified.
Alternatively, the document to be identified may be any document, and for convenience of description, the document used for identifying the hierarchical heading is referred to as the document to be identified.
In an alternative embodiment, the electronic device for editing the document to be recognized may execute the method flow locally and not interact with other devices. In a specific example, the method flow may be implemented by local logic included in a document editing software client.
In another alternative embodiment, other electronic devices interacting with the electronic device editing the document to be recognized may also perform the method flow. In a specific example, the method flow may be implemented by a server interacting with a document editing software client.
Therefore, in an alternative embodiment, the client may send any document to the server based on the operation of the user, that is, the server receives the document to be identified sent by the client, so that the server identifies the hierarchical title in the document to be identified and returns the identification result to the client. The recognition result may be a document containing a title label, or may be a paragraph identifier recognized as a hierarchical title, and the paragraph identifier may specifically be a paragraph number.
In addition, in an alternative embodiment, the following steps may be performed to determine the text paragraph set of the sub-document, where the sub-document is identified for one document, and then any identified sub-document is determined as the document to be identified.
Since a sub-document may include only one body, there is only one set of body paragraphs.
S402: and determining a text paragraph set in the document to be recognized according to the third paragraph characteristics.
Optionally, the third paragraph feature may be a paragraph feature used to help determine text. The method specifically comprises the following steps: whether the code, the font size, the bolding, the alignment mode and the like are included.
The determined set of text paragraphs may be one or more. Of course, in the case where the document to be identified is a sub-document, the determined set of text segments is one.
In an alternative embodiment, specifically, the probability that each paragraph in the document to be recognized belongs to the body text may be determined based on the third paragraph feature. Specifically, the probability that the paragraph belongs to the text may be output by using a machine learning method for the third paragraph feature of the input paragraph.
Further, the paragraph with the text probability larger than the preset probability can be determined, so as to determine the maximum word size of the text in the paragraph. And identifying all paragraphs with the font size smaller than the maximum font size of the text in the document to be identified as text paragraphs.
Optionally, a text paragraph in the document to be recognized may be determined according to the third paragraph feature; the determined text passage is one or more. Traversing the determined text paragraphs; under the condition that the paragraph sequence number of the currently traversed text paragraph is continuous with the paragraph sequence number of the next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in the same text paragraph set; in the case that the paragraph number of the currently traversed text paragraph is not consecutive to the paragraph number of the next text paragraph, the currently traversed text paragraph and the next text paragraph are collected in different text paragraph sets.
One or more text paragraph sets may be collected by the above embodiments. A text paragraph set may include one text paragraph or a plurality of text paragraphs that are consecutive in paragraph number.
The paragraph number sequence can characterize the sequence number of the paragraph is increased according to the sequence, and each time, the sequence number can be increased by 1. For example, consecutive paragraph numbers include 12,13, 14.
In this embodiment, subsequent steps can be conveniently performed by determining a text paragraph set. And the third paragraph characteristic is used for identification, so that the identification efficiency can be improved, and the identification accuracy can be improved.
In an alternative embodiment, the obtained document to be identified may be an identified sub-document, and the sub-document already identifies the text paragraphs contained therein during the identification process. Therefore, in the case that the document to be identified is a sub-document, the text paragraphs therein have been identified, and the text paragraph set is specifically determined, so that the identified text paragraphs can be directly determined to obtain the text paragraph set.
In an alternative embodiment, the obtained document to be identified may also include one or more sub-documents that need to be divided. Therefore, the sub-documents in the document to be identified can be divided according to the paragraph characteristics, the number of the divided sub-documents can be one or more, and then the text paragraph set in each sub-document is determined.
The embodiment does not limit the method for dividing the sub-document, and optionally, the document to be identified may be divided into the attachment documents according to the preset attachment paragraph characteristics; the attachment file is one or more.
For each attachment document, paragraphs in each attachment document having the same third paragraph characteristics may be collected as a set of preliminary paragraphs; the collection of one or more preliminary paragraph sets.
And for the paragraph with the minimum paragraph number and the paragraph with the maximum paragraph number in each prepared paragraph set, determining all paragraphs between the paragraph with the minimum paragraph number and the paragraph with the maximum paragraph number in the document to be identified, and adding the determined paragraphs into the prepared paragraph set.
Performing merging processing on all the prepared paragraph sets in each attachment document, wherein the merging processing can include merging the prepared paragraph sets with intersections; there is no intersection between different sets of preparatory paragraphs after the merging process.
And traversing the combined prepared paragraph set, and determining the currently traversed prepared paragraph set as a prepared header or a prepared text.
Determining a sub-document according to the determined preparation title and the preparation text; the determined sub-documents are one or more; and for each sub-document, collecting the paragraphs determined as the prepared text in each sub-document to obtain a text paragraph set.
Optionally, the preset attachment paragraph feature may be used to characterize an attachment paragraph, and the attachment paragraph may specifically be a paragraph representing an attachment, and specifically may be a paragraph containing an attachment keyword. Of course, the attachment keywords may include: accessories, drawings, etc.
Optionally, the document to be identified is divided into the attachment documents according to the preset attachment paragraph characteristics, and the attachment paragraphs in the document to be identified may be determined according to the preset attachment paragraph characteristics, and then the document to be identified is divided into the attachment documents according to the attachment paragraphs.
Through the attachment paragraphs, the document to be identified can be initially divided into a plurality of document parts, so that the sub-documents can be obtained through further subsequent division.
Of course, if the document to be recognized does not contain an attachment paragraph, the document to be recognized may be regarded as an attachment document. If the document to be recognized contains an attachment paragraph, each divided document part can be determined as an attachment document according to the attachment paragraph.
Optionally, specifically during the merging process, for each attachment document, each collected preliminary paragraph set may be traversed, and in a case where the currently traversed preliminary paragraph set intersects with any other preliminary paragraph set, the currently traversed preliminary paragraph set and any other preliminary paragraph set that intersects may be merged. The new preliminary paragraph set obtained by merging includes other preliminary paragraph sets, and therefore the new preliminary paragraph set obtained by merging needs to be traversed, and it is determined again whether there is an intersection between the other preliminary paragraph sets and the new preliminary paragraph set, and merging is performed.
For ease of understanding, in one specific example, [20,40] may represent a preliminary paragraph set containing paragraphs numbered 20 through 40, and [30,40] may represent a preliminary paragraph set containing paragraphs numbered 30 through 40. Since there is an intersection between the two sets of preliminary paragraphs, after the merging process, the new set of preliminary paragraphs [20,40] can be obtained.
Of course, the present embodiment does not limit the specific merging method, as long as there is no intersection between different preliminary segment sets after merging.
When the preliminary paragraph set is determined to be a preliminary title or a preliminary body, the preliminary title or the preliminary body can be determined by using the maximum font size of the body by using the method for determining the maximum font size of the body.
Specifically, the preliminary segment set with the font size larger than the maximum font size of the body text may be determined as the preliminary title, and the preliminary segment set with the font size smaller than or equal to the maximum font size of the body text may be determined as the preliminary body text.
Optionally, a sub-document may then be determined based on the determined preliminary body and preliminary header.
It is first necessary to determine the body and the title from the determined preliminary body and preliminary title. Specifically, for each attachment document, all the preparation paragraph sets included in the attachment document are sorted according to the paragraph sequence number, and the preparation paragraph sets which are continuous in sorting sequence number and are all determined as preparation texts are merged to obtain a text part; it is also possible to merge a preliminary paragraph set in which the sort order numbers are consecutive and all of which are determined as preliminary titles, resulting in a title portion.
Since the sub-document generally contains a body and a header, and the header generally precedes the body, a header portion and a body portion which are continuous may be optionally determined as a sub-document in the order from front to back based on the paragraph number of each attached document.
Of course, a body portion may also be directly determined as a sub-document if there is no corresponding header portion.
Alternatively, after determining the sub-document, the body part therein may be directly determined as the body paragraph set.
S403: an alternative hierarchy is determined from the set of text paragraphs.
The determined alternative hierarchies may be one or more.
Since one or more text paragraph sets in the document to be recognized may be determined in S402, optionally, S403-S405 may be performed for each text paragraph set to determine a text hierarchy corresponding to each text paragraph set.
Specifically, during explanation, the flow of the method only explains the embodiment of executing S403-S405 on a single body paragraph set, and other body paragraph sets can obtain the body hierarchy structure through the same embodiment.
According to the number contained in the text paragraph set and the relationship between the text paragraphs, one or more hierarchical relationships can be determined, and further one or more hierarchical structures can be determined. Specific methods are explained later.
Alternatively, the determined hierarchy may be determined directly as an alternative hierarchy. In particular, the alternative hierarchies may be determined from the text paragraph set, and the determined alternative hierarchies may be one or more.
Optionally, a eligible hierarchy in the determined hierarchy may also be determined as an alternative hierarchy according to a preset set of hierarchies. The determined alternative hierarchies may be one or more.
In particular, if a preset hierarchy set contains any hierarchy determined, the hierarchy may be determined as an alternative hierarchy.
Wherein the preset hierarchy set may contain one or more preset hierarchies.
Because the hierarchical structure and the numbering format which do not accord with the specification may appear in the document editing process, for example, chapter a, chapter 1, and the like, the hierarchical structure which accords with the specification can be preset to be used as a preset hierarchical structure set, so that the text hierarchical structure which accords with the specification can be identified and determined conveniently, and the user experience is improved.
Of course, alternatively, the preset hierarchy set may include a hierarchy with a numbering format that is in compliance with the specification, or may include a hierarchy predetermined by the service person.
S404: in the determined alternative hierarchical structure, a body hierarchical structure of the document to be identified is determined.
Alternatively, the alternative hierarchies may be determined by the numbering and hierarchical relationship of the body paragraphs, which may contain misidentified hierarchical relationships, or incomplete hierarchies, or hierarchies with discrepancies. In particular, the same numbering format may be used, corresponding to different levels in different alternative hierarchies.
For example, a text paragraph set may include the following numbers, chapter 1, chapter 2, chapter 3, chapter two, chapter one, chapter two, and chapter three, respectively. Obviously, two hierarchies can be included, respectively [ chapter i, 1] and [ chapter i, one ], with different numbering formats corresponding to the levels.
Optionally, for convenience of determining a unique hierarchical title in the text paragraph set, a correct hierarchical structure needs to be determined from the alternative hierarchical structures, and for convenience of description, the text hierarchical structure may be specifically referred to as a text hierarchical structure of the document to be identified, and may also be a text hierarchical structure corresponding to the text paragraph set.
S405: and determining a hierarchical title according to the text hierarchical structure of the document to be identified.
Optionally, since the text hierarchy includes the number format and the corresponding hierarchy, the number format included in the text hierarchy can be directly determined, and a paragraph having the same number format as any one of the number formats in the text hierarchy can be determined as a hierarchical title.
In particular, the hierarchy of the hierarchy header and the corresponding hierarchy of the included numbering format in the body hierarchy may be the same.
For example, the first chapter number format may correspond to a level in the body hierarchy, and then the paragraph containing the first chapter number format may be determined to be a level header or a level header body.
Through the text hierarchical structure, paragraphs belonging to the hierarchical titles and the hierarchies corresponding to the hierarchical titles can be conveniently determined, the operation experience of a user is improved, the corresponding typesetting format can be conveniently determined to be adjusted according to the hierarchies corresponding to the hierarchical titles, and the method specifically comprises the operation of adjusting fonts and the like.
In the method flow, identification can be carried out according to the characteristics of the hierarchical titles in the document. The method specifically comprises the steps of determining a text paragraph in the document, further determining a hierarchical structure in the text paragraph, further determining a hierarchical title according to the hierarchical structure, and conveniently and automatically adjusting the hierarchical title to a correct typesetting format, so that the method is convenient for a user to operate and avoids errors.
In an optional embodiment, the method for identifying a hierarchical title may be applied to the method for identifying a title, and specifically, in the method for identifying a title, the method further includes the steps S402 to S405.
The following is a detailed explanation with respect to S403.
After the text paragraph set in the document to be recognized is determined, the text paragraph set can be regarded as a whole text, which may include a hierarchical heading.
Before the hierarchical title is determined, the hierarchical relationship among different numbering formats can be determined according to the paragraphs containing numbers in the text paragraph set to obtain one or more alternative hierarchical structures, and after the text hierarchical structure is determined from the alternative hierarchical structures, the hierarchical title is determined according to the hierarchical relationship in the text hierarchical structure.
For example, the format of the number with the highest hierarchical relationship in the hierarchical structure is determined as the format of the number that the first-level title needs to contain; the format of the number next highest in the hierarchical relationship in the hierarchy is determined as the format of the number that the secondary header needs to contain. So that the hierarchical structure can be used to determine the hierarchical title.
In an alternative embodiment, the hierarchy may contain a single numbering format, or multiple numbering formats. Different numbering formats may correspond to different hierarchies.
Wherein a single numbering format can generally be determined directly as the highest-level numbering format. When a plurality of numbering formats exist, the hierarchical relationship among the plurality of numbering formats or the hierarchy corresponding to each of the plurality of numbering formats needs to be determined.
For example, the numbering format with the highest hierarchical relationship can be directly determined as "one" in the hierarchy [ one ], and 3 numbering formats in the hierarchy [ one, 1, (1) ] can be determined, wherein the hierarchical relationships from high to low are respectively "one", "1" and "(1)".
Of course, in other alternative embodiments, the hierarchical structure may also include one or more numbered text paragraphs, where there is a hierarchical relationship between the numbered text paragraphs.
When the hierarchical structure is determined, the number formats contained in the text paragraph set can be obtained from the text paragraphs containing numbers in the text paragraph set, and then the hierarchical relationship among the number formats is determined, so that the hierarchical structure can be determined.
However, there may be various hierarchical relationships between the determined numbering formats, for example, the numbering format "one" may be one level higher than the numbering format "1", and the numbering format "one" may be one level higher than the numbering format "(one)". In other words, if the hierarchy of the number format "1" is the same as the hierarchy of the number format "(one)", it is difficult to specify a unique hierarchical structure.
Therefore, the determined hierarchies may be integrated as alternatives, and then the most likely alternative is selected from the alternatives as the finally determined hierarchy.
In an alternative embodiment, the predetermined method of determining the hierarchy may comprise: determining an alternative hierarchical structure based on the text paragraphs containing the numbers in the text paragraph set; a hierarchy is determined from the determined alternative hierarchies.
Of course, in alternative embodiments, a hierarchy may be randomly selected directly from a plurality of determined hierarchies, and need not be determined from an alternative hierarchy.
In an alternative embodiment, the method for specifically determining the alternative hierarchical structure may include: in a text paragraph set, determining paragraphs containing numbers; determining a hierarchical relationship between paragraphs for the determined paragraphs; determining the hierarchical relationship among the formats of the numbers contained in the paragraphs according to the hierarchical relationship among the paragraphs; determining an alternative hierarchical structure according to the hierarchical relationship between the determined number formats; one or more levels are included in the determined alternative hierarchical structure, with different levels included corresponding to different numbering formats.
Alternatively, since each hierarchical title may correspond to a part of the body, in general, in the body part corresponding to one hierarchical title, the body part may be continuously divided using the hierarchical title of the next hierarchical title. In other words, in general, the body part corresponding to the hierarchical title of the previous hierarchical level may contain the body part corresponding to the hierarchical title of the next hierarchical level.
Therefore, in order to facilitate the definition of the hierarchical relationship between the numbering formats, the body part corresponding to the numbered paragraph may be determined first.
Optionally, for convenience of description, the body part corresponding to the paragraph containing the number is referred to as an overlay paragraph set.
The step of determining an alternative hierarchical structure may comprise the following step of determining a set of overlay paragraphs for the body paragraphs containing the numbers.
The method specifically comprises the following steps: adding text paragraphs containing numbers with the same format in a text paragraph set into the same text paragraph subset; the resulting subset of text paragraphs may be one or more; and traversing paragraphs in the text paragraph subset according to the paragraph sequence numbers from small to large for each text paragraph subset, and adding paragraphs, which have paragraph sequence numbers larger than or equal to the paragraph sequence number of the currently traversed paragraph and smaller than the paragraph sequence number of the next paragraph to be traversed, in the corresponding text paragraph set to the covering paragraph set of the currently traversed paragraph under the condition that the next paragraph to be traversed exists.
And in the case that the next paragraph to be traversed does not exist, adding the currently traversed paragraph to the covering paragraph set of the currently traversed paragraph.
In other words, for the last traversed text paragraph, the set of overlay paragraphs for that text paragraph may include only the text paragraph itself.
After the cover paragraph set of each text paragraph containing a number in the text paragraph set is determined, an alternative hierarchical structure may be determined in the pre-stored hierarchical structure set according to the cover paragraph set of the paragraph containing a number in the corresponding text paragraph set.
Optionally, the alternative hierarchical structure may also be directly determined according to an overlay paragraph set including numbered paragraphs in the corresponding body paragraph set.
Optionally, the alternative hierarchical structure is determined by specifically using the set of cover paragraphs, and a hierarchical relationship between different paragraphs containing numbers, that is, a hierarchical relationship between formats of numbers contained in different paragraphs, may be determined according to a case of intersection between the sets of cover paragraphs.
Optionally, the paragraphs containing numbers in the text paragraph set may be traversed according to paragraph numbers in descending order.
And under the condition that the covering paragraph set of the currently traversed text paragraph and the covering paragraph set of the next paragraph to be traversed have no intersection, further judging whether the format of the number contained in the currently traversed text paragraph is the same as that of the number contained in the first paragraph.
If the text paragraph of the current traversal contains the same number in the same format as the number of the next paragraph, the hierarchical relationship between the text paragraph of the current traversal and the next paragraph is determined to be used for representing the same hierarchy. The hierarchical relationship between the currently traversed paragraph and the next paragraph may include: the level corresponding to the number format contained in the currently traversed paragraph and the level corresponding to the number format contained in the next paragraph have a high-low relationship.
If the currently traversed body paragraph contains a number in a different format than the next paragraph, a hierarchical relationship may be determined from the similar paragraphs.
In the case where there is an intersection between the set of cover paragraphs of the currently traversed body paragraph and the set of cover paragraphs of the next paragraph, a hierarchical relationship may be determined based on the similar paragraphs.
Wherein, optionally, the similar paragraph may be a body paragraph containing the same number format as the next paragraph.
Obviously, if there is a similar paragraph, it can be determined that the similar paragraph is the same as the level corresponding to the next paragraph. Therefore, it is possible to determine whether or not there is a similar paragraph in the body paragraph that has been traversed.
Optionally, determining the hierarchical relationship according to the similar paragraphs may include: determining the hierarchical relationship between the similar paragraph and the next paragraph for representing the same hierarchy under the condition that the similar paragraph exists in the traversed text paragraph; and under the condition that no similar paragraph exists in the traversed paragraphs, determining the hierarchical relationship between the currently traversed text paragraph and the next paragraph for characterization, wherein the hierarchy corresponding to the currently traversed text paragraph is one level higher than the hierarchy corresponding to the next paragraph.
The hierarchical relationship between the similar paragraph and the next paragraph may include: the level corresponding to the number format contained in the similar paragraph and the level corresponding to the number format contained in the next paragraph have a high-low relationship.
Optionally, when traversing the traversed text paragraphs and determining whether similar paragraphs exist therein, traversing the traversed text paragraphs may be performed in descending order according to paragraph numbers. Traversal can also be performed for the traversed text paragraphs in order from small to large.
Optionally, after traversing to the text paragraph with the largest paragraph number in the corresponding text paragraph set, there is no next text paragraph to be traversed, and the step for determining the hierarchical relationship may not be performed.
In the text paragraph set, the text paragraph with the largest paragraph number can determine the hierarchical relationship when being used as the next paragraph, so the text paragraph with the largest paragraph number also has the hierarchical relationship with other text paragraphs.
Obviously, after the traversal is finished, the hierarchical relationship between different numbered text paragraphs in the corresponding text paragraph set can be determined, and the temporary hierarchical structure can be further determined according to the determined hierarchical relationship between the text paragraphs. The determined temporary hierarchy may be one or more.
Alternatively, the temporary hierarchy may be a hierarchy determined directly from the determined hierarchical relationship, and the temporary hierarchy may be determined directly as an alternative hierarchy.
Optionally, a preset hierarchical structure set may also be used to perform screening and judgment on the temporary hierarchical structure, and determine the alternative hierarchical structure from the preset hierarchical structure set.
Specifically, when any temporary hierarchy is included in the preset hierarchy set, the temporary hierarchy may be determined as an alternative hierarchy.
Specifically, the hierarchical relationship between the numbering formats included in the text paragraph set may be further determined, and the temporary hierarchical structure may be determined by using the hierarchical relationship between the numbering formats. Since there may be more than one hierarchical relationship, at least one temporary hierarchy may be correspondingly determined.
Optionally, traversing the hierarchical relationship between paragraphs may be specific; determining paragraphs contained in the currently traversed hierarchical relationship according to the currently traversed hierarchical relationship; determining a format containing numbers in the paragraphs according to the determined paragraphs; and determining the hierarchy relation of the current traversal as the hierarchy relation between the determined numbering formats.
Correspondingly, the alternative hierarchical structure is specifically determined, and a temporary hierarchical structure may be determined according to the hierarchical relationship between the determined number formats; the determined temporary hierarchical structure is one or more, one or more levels are included, and different levels are included and correspond to different numbering formats; and traversing the determined temporary hierarchical structure, and determining the currently traversed temporary hierarchical structure as an alternative hierarchical structure under the condition that the currently traversed temporary hierarchical structure exists in the preset hierarchical structure set.
Of course, in other alternative embodiments, the temporary hierarchical structure is determined, or each text paragraph in any text paragraph set may be directly traversed without determining the overlay paragraph set, the numbers included in the text paragraphs are obtained, and the hierarchical relationship between the numbers is directly determined according to the obtained sequence, so as to determine the temporary hierarchical structure.
In an alternative embodiment, the step of specifically determining the alternative hierarchical structure may be that, when a text paragraph containing a number in the text paragraph set is traversed each time, after a hierarchical relationship between a currently traversed text paragraph and a next paragraph is determined, a temporary hierarchical structure is determined according to the currently determined hierarchical relationship, a format of the currently traversed text paragraph containing the number, and a format of the next paragraph containing the number.
Optionally, at least the currently determined highest-level numbered format, the currently traversed text paragraph containing numbered format, and the next paragraph containing numbered format may be included in the determined temporary hierarchy.
Optionally, specifically, when determining the temporary hierarchical structure, a hierarchical relationship including a numbered format with the highest hierarchy may be determined by using all hierarchical relationships related to formats in which a currently traversed text paragraph includes numbers, all hierarchical relationships related to formats in which a next paragraph includes numbers, and the currently determined hierarchical relationship, and the temporary hierarchical structure may be determined based on the determined hierarchical relationship.
In this embodiment, the determined temporary hierarchy may be added to the set of temporary hierarchies so that duplicate temporary hierarchies may be integrated. Duplicate temporary hierarchies may also be deleted where they are determined.
In one specific example, a set of text paragraphs is included that includes the following text paragraphs: paragraphs containing the numbers "one" and "two"; in the text paragraph corresponding to the number "one", there are paragraphs containing the numbers "1" and "2"; in the text paragraph corresponding to the number "two", there are paragraphs containing the numbers "(one)" and "(two)".
Based on the above embodiment, it can be determined that the hierarchical relationship between the paragraph containing the number "one" and the paragraph containing the number "1" or "2" is such that the paragraph containing the number "one" is one level higher than the paragraph containing the number "1" or "2"; and it can be determined that the hierarchical relationship between the paragraph containing the number "two" and the paragraph containing the number "(one)" "(two)" is such that the paragraph containing the number "two" is hierarchically one level higher than the paragraph containing the number "(one)" "(two)".
Thus, 2 hierarchical relationships are available, and based on these 2 hierarchical relationships, 2 alternative hierarchical structures can be determined, which are [ one, 1] and [ one (one) ], respectively.
Of course, when the hierarchical relationship is determined by traversal, each time a hierarchical relationship is determined, a corresponding alternative hierarchical structure is determined, and then 3 non-repeating alternative hierarchical structures, which are [ one ], [ one, 1] and [ one (one) ], can be determined.
The following is a detailed explanation with respect to S404.
After the alternative hierarchy is determined, a body hierarchy of the document to be identified may be further determined.
In an alternative embodiment, in the case that an alternative hierarchical structure is determined, the determined alternative hierarchical structure may be directly determined as the body hierarchical structure of the document to be identified.
In another alternative embodiment, in the case that a plurality of alternative hierarchies are determined, an alternative hierarchy meeting preset requirements may be determined as the text hierarchy of the document to be identified.
Alternatively, the body hierarchy may uniquely correspond to one body entity, and the document to be identified may contain a plurality of body entities. Thus, one or more body hierarchies may be contained in the document to be identified. It should be noted that the document to be identified may be an identified sub-document that contains only a body whole and therefore may contain only a body hierarchy.
Optionally, the preset requirements may include: corresponding to the most numerous alternative hierarchies. The alternative hierarchy corresponding paragraphs may be: in a body paragraph collection, the numbered paragraphs are included in the same format as any of the numbered paragraphs in the alternative hierarchies.
Optionally, in the determined alternative hierarchical structure, determining a text hierarchical structure of the document to be recognized may include: and traversing the alternative hierarchical structure and determining the number of paragraphs corresponding to the currently traversed alternative hierarchical structure.
And after traversing, determining the text hierarchical structure of the document to be identified according to the number of paragraphs corresponding to the alternative hierarchical structure.
Specifically, the alternative hierarchical structure with the largest number of corresponding paragraphs may be determined as the text hierarchical structure of the document to be identified.
Optionally, the segment corresponding to the hierarchical structure may include: the corresponding text paragraph set comprises text paragraphs with the same numbering format as any numbering format in the hierarchical structure.
In addition, since the document format specifies the number format that the correct hierarchical title should use, which is "one", "1." (1) ", respectively, it is possible to select an alternative hierarchical structure whose number format is the correct number format as much as possible.
For convenience of description, the hierarchy containing the correct numbering format is referred to as a preset hierarchy.
For the error hierarchical structure outside the preset hierarchical structure, the method is convenient for user editing and is suitable for different user editing conditions, and the hierarchical title can be further determined by using the error hierarchical structure, so that the user experience is improved.
Therefore, optionally, in a case where any one preset hierarchy is included in the determined plurality of candidate hierarchies, the preset requirement may include: the number of the paragraphs corresponding to the alternative hierarchical structure is the largest, and the alternative hierarchical structure is a preset hierarchical structure.
In the case where any of the determined plurality of candidate hierarchies does not include any preset hierarchy, the preset requirements may include: the number of paragraphs corresponding to the alternative hierarchy is the largest.
Therefore, under the condition that the determined alternative hierarchical structure does not contain any preset hierarchical structure, determining the alternative hierarchical structure with the largest number of corresponding paragraphs as the text hierarchical structure of the document to be identified; and under the condition that the determined alternative hierarchical structure contains any preset hierarchical structure, determining the preset hierarchical structure with the maximum number of corresponding paragraphs as the text hierarchical structure of the document to be identified.
Optionally, the preset hierarchical structure may include: [ a (one), (one) ], [ a (one), 1] and [ a (one), 1, (1) ].
The following is a detailed explanation with respect to S405.
After the body hierarchy is determined, a hierarchical heading may be determined from the body hierarchy.
Alternatively, it may specifically be that the corresponding paragraph of the body hierarchy is determined as the hierarchy header.
Optionally, determining the hierarchical title according to the body hierarchy may include: determining a corresponding paragraph of a text hierarchy; traversing the determined corresponding paragraph, determining the corresponding level of the numbering format contained in the currently traversed corresponding paragraph in the text hierarchical structure, and determining the currently traversed corresponding paragraph as a hierarchical title according to the determined level.
The paragraph corresponding to the text hierarchy may be: in a text paragraph collection, the numbered paragraphs are included in the same format as any numbered paragraph in the text hierarchy.
Determining the corresponding paragraph currently traversed as a hierarchy header according to the determined hierarchy, which may include: the corresponding paragraph of the current traversal is determined as the title or title body corresponding to the determined level.
In particular, the corresponding paragraph of the current traversal may be marked as a title or a title body corresponding to the determined hierarchy.
The identification of the specific title or the text of the title can be performed according to the number of characters in the paragraph or the characteristics of the paragraph. Alternatively, if the number of characters contained in the paragraph is greater than the preset number, the paragraph is determined as the main body of the title, otherwise, the paragraph is determined as the title.
In a specific example, the determined text hierarchy may be [ one, (one), 1, (1) ], and the numbering formats thereof may respectively correspond to different hierarchies, and specifically may include one level: the method comprises the following steps of I, II: (I), three-stage: 1, four stages: (1).
Thus, a paragraph containing a number in the format of a "one" number may be identified as a primary title, a paragraph containing a number in the format of a "(one)" number may be identified as a secondary title, and so on.
Obviously, according to the determined text hierarchy, in the case that the number format contained in the text paragraph in the corresponding text paragraph set is determined to be the same as any number format in the determined text hierarchy, the hierarchy corresponding to the text paragraph can be determined.
In the case where the electronic device for editing the document to be recognized locally executes the process of the method, the typesetting format can be automatically adjusted directly based on the recognition result.
In a case where the process of the method is executed by another electronic device interacting with the electronic device editing the document to be recognized, in an optional embodiment, after all hierarchical titles are determined for the document to be recognized, the recognition result may be further returned to the electronic device editing the document to be recognized, and of course, the recognition result may also be returned after the layout format is adjusted.
Optionally, the paragraph identifier corresponding to each identified hierarchical heading may be returned to adjust the layout for the paragraph corresponding to the received paragraph identifier. A particular paragraph identification may be a paragraph number.
Alternatively, the layout of each identified hierarchical title in the document to be identified may be adjusted, and the adjusted document to be identified may be returned.
Fig. 15 is a schematic diagram of document interaction in a method for identifying hierarchical titles according to an embodiment of the present invention, as shown in fig. 15.
The user can trigger the operation for identifying the hierarchy title through the client aiming at the document to be identified, so that the client sends the document to be identified to the server. The client may specifically be a client of document editing software.
After identifying the hierarchical title in the document to be identified, the server side can return the paragraph identifier of the paragraph corresponding to the hierarchical title to the client side.
In this embodiment, by returning the recognition result to the client, the client can conveniently and automatically perform typesetting adjustment, so that operations such as manual adjustment of a user are avoided, user operation is facilitated, and user experience is improved.
Through the method and the process, the hierarchical titles in the text can be determined by utilizing the paragraph characteristics of the hierarchical titles, and the method and the system are compatible with various conditions that a user may edit errors, so that the hierarchical titles in various conditions can be identified in the complex document editing condition. The method and the system can not only be fault-tolerant and facilitate the editing and use of users, but also be convenient for automatically adjusting the identified hierarchical titles into correct typesetting formats by utilizing the identified hierarchical titles, thereby facilitating the operation of the users and improving the user experience.
The hierarchical titles identified by the method flow can be used for the proofreading of the subsequent administrative official document format and the conversion of the administrative official document.
Specifically, when the administrative official document proofreading function provided by the software is used, the hierarchical titles in the document are identified through the above method flow, and whether the typesetting format of the identified hierarchical titles meets the specification is further checked. If the result meets the regulation, returning the result passing the proofreading; if the rule is not met, the result of the non-passing proofreading can be returned, or the typesetting format of the identified hierarchical title is further automatically adjusted, so that the adjusted typesetting format is met.
Alternatively, when the administrative official document conversion function provided by the software is used, the hierarchical titles in the document are identified through the above method flow, and the typesetting format of the identified hierarchical titles is further automatically adjusted to meet the specified typesetting format.
In addition, one or more titles may be contained in the document to be identified, and different titles may correspond to different body parts. For example, when a novel or magazine is edited in a document, it may typically contain a plurality of articles, and different articles may contain different titles and corresponding different texts.
For convenience of description, a title and corresponding text in the document to be recognized are collectively referred to as a sub-document. Thus, one or more subdocuments may be included in the document to be identified.
For example, a plurality of chapters may be edited in one document, each chapter content including a chapter title and a chapter body, and each chapter content may be considered a subdocument. The attachment content may also be edited in the document, which may be considered a sub-document.
The document to be identified in the above method flow may be any identified sub-document.
Therefore, in an alternative embodiment, the sub-documents may be identified, and then a further identification may be performed for each sub-document as the document to be identified, and the hierarchical titles therein may be identified.
For convenience of understanding, the embodiment of the invention also discloses an application embodiment.
As shown in fig. 16, fig. 16 is a flowchart of another method for identifying hierarchical titles according to the embodiment of the present invention. For convenience of description of the text paragraph set, the text paragraph set is referred to as a text region.
S501: a text region is identified. The method specifically comprises the following steps: and identifying the sub-document in the user document, and acquiring a list of paragraph areas of the attachment, the title and the text.
S502: and acquiring a text area list, traversing the text area, and constructing a document hierarchical structure of the currently traversed text area.
S503: and acquiring the hierarchical structure which is most likely to be the official document according to the document hierarchical structure of the text area.
Specifically, the hierarchical structure most likely to be the current text area is obtained according to the priority of the hierarchical structure and the paragraphs included in the hierarchical structure.
S504: and acquiring a paragraph list corresponding to the hierarchical structure, and marking paragraphs in the paragraph list as [ one, two, three and four ] level headings or [ one, two, three and four ] level headings.
And finding out corresponding paragraphs according to the document hierarchical structure corresponding to the text area, and marking the identification results of the corresponding paragraphs as the (one, two, three and four) level headings or the (one, two, three and four) level headings and texts.
In order to facilitate further understanding, the embodiment of the invention also provides 2 specific application embodiments.
Application example three.
As shown in fig. 17, fig. 17 is a schematic diagram of a document to be identified according to an embodiment of the present invention. Which contains 30 paragraphs, some of which have been labeled with paragraph numbers. Wherein part of the text paragraphs are omitted.
The document to be identified comprises 17 paragraphs, the sub-document in the user document is identified, and the list of the paragraph areas for acquiring the attachments, the titles and the texts is [ [1,2, title ], [4,17, text ] ].
Wherein the area of the sub-document is [ [ [1,2, title ], [4,17, body ] ] ], the text paragraph area is traversed, and the document hierarchy of the text area containing numbered paragraphs is identified as [ [ one ], [ one (one) ] ].
According to the priority of the hierarchical structure, determining that a preset hierarchical structure [ one (one) ] is contained, therefore, determining that the number of paragraphs contained in the preset hierarchical structure [ one (one) ] is 10, and determining the preset hierarchical structure [ one (one) ] as the document hierarchical structure corresponding to the text area.
The corresponding paragraphs of the hierarchy are found, i.e. "one" for [6,8,10,14] and "(one)" for [11,12,13,15,16,17] respectively.
Thus, sections 6,8,10 and 14 are primary headings and sections 11,12,13,15,16 and 17 are secondary headings.
Application example four.
As shown in fig. 18, fig. 18 is a schematic diagram of another document to be identified according to the embodiment of the present invention. Which contains 20 paragraphs, with part segment and paragraph numbers already indicated in the figure. Wherein part of the text paragraphs are omitted.
The document to be identified comprises 15 paragraphs, the sub-document in the user document is identified, and the list of the paragraph areas for acquiring the attachments, the titles and the texts is [ [2,2, title ], [5,15, text ] ].
Wherein the area of the sub-document is [ [ [2,2, title ], [5,15, body ] ] ], traversing the body paragraph area, identifying that the document hierarchy of the body area containing numbered paragraphs is [ [ one ], [ one, 1], determining that no preset hierarchy is contained therein, and thus, determining the number of paragraphs contained in the 2 hierarchies as [ [ one ],3] and [ [ one, 1],10, respectively.
And determining the document hierarchical structure corresponding to the text area as [ one, 1] according to the number of the contained paragraphs.
And finding out paragraphs corresponding to the hierarchical structure, namely, a paragraph is corresponding to the paragraph [6,9 and 12], and a paragraph is corresponding to the paragraph [7,8,10,11,13,14 and 15] through 1.
Thus, sections 6,9 and 12 are primary headings and sections 7,8,10,11,13,14 and 15 are secondary headings.
Corresponding to the method for identifying the hierarchical titles, the embodiment of the invention also provides an embodiment of a device for identifying the hierarchical titles.
As shown in fig. 19, fig. 19 is a schematic structural diagram of an apparatus for identifying a hierarchical title according to an embodiment of the present invention. The apparatus may include the following elements.
An obtaining unit 601, configured to obtain a document to be identified.
A text determining unit 602, configured to determine a text paragraph set in the document to be recognized according to the third paragraph feature.
An alternative structure determining unit 603, configured to determine an alternative hierarchical structure according to the text paragraph set; the determined alternative hierarchies are one or more.
A body structure determining unit 604, configured to determine, in the determined alternative hierarchical structure, a body hierarchical structure of the document to be identified.
A hierarchical heading determination unit 605 configured to determine a hierarchical heading from the text hierarchical structure of the document to be identified.
Optionally, the text determining unit 602 includes:
a hierarchical relationship determining subunit 602a, configured to determine, in the text paragraph set, a paragraph including a number; determining a hierarchical relationship between paragraphs for the determined paragraphs; and determining the hierarchical relationship among the formats of the numbers contained in the paragraphs according to the hierarchical relationship among the paragraphs.
An alternative determining subunit 602b, configured to determine an alternative hierarchical structure according to the determined hierarchical relationship between the numbering formats; one or more levels are included in the determined alternative hierarchical structure, with different levels included corresponding to different numbering formats.
Optionally, the text structure determining unit 604 includes:
an alternative traversing subunit 604a, configured to traverse the alternative hierarchical structure, and determine the number of paragraphs corresponding to the alternative hierarchical structure; wherein, the alternative hierarchical structure corresponding paragraphs are: in a body paragraph collection, the numbered paragraphs are included in the same format as any of the numbered paragraphs in the alternative hierarchies.
And the text determining subunit 604b determines the text hierarchical structure of the document to be identified according to the number of the paragraphs corresponding to the alternative hierarchical structure after the traversal is finished.
Optionally, the text determining subunit 604b is configured to:
determining the alternative hierarchical structure with the largest number of corresponding paragraphs as a text hierarchical structure of the document to be identified under the condition that the determined alternative hierarchical structure does not contain any preset hierarchical structure;
and under the condition that the determined alternative hierarchical structure contains any preset hierarchical structure, determining the preset hierarchical structure with the maximum number of corresponding paragraphs as the text hierarchical structure of the document to be identified.
Optionally, the hierarchical title determining unit 605 includes:
a text corresponding paragraph determining subunit 605a configured to determine a text hierarchy structure corresponding paragraph; wherein, the corresponding paragraphs of the text hierarchy structure are: in a text paragraph collection, the numbered paragraphs are included in the same format as any numbered paragraph in the text hierarchy.
The title determining subunit 605b is configured to traverse the determined corresponding paragraph, determine a corresponding level of the numbering format included in the currently traversed corresponding paragraph in the text hierarchy, and determine the currently traversed corresponding paragraph as a level title according to the determined level.
Optionally, a title determining subunit 605b, configured to:
the corresponding paragraph of the current traversal is determined as the title or title body corresponding to the determined level.
Optionally, the text determining unit 602 is configured to:
determining a text paragraph in the document to be identified according to the third paragraph characteristic; one or more text paragraphs; traversing the text paragraphs; under the condition that the paragraph sequence number of the currently traversed text paragraph is continuous with the paragraph sequence number of the next text paragraph, collecting the currently traversed text paragraph and the next text paragraph in the same text paragraph set; in the case that the paragraph number of the currently traversed text paragraph is not consecutive to the paragraph number of the next text paragraph, the currently traversed text paragraph and the next text paragraph are collected in different text paragraph sets.
Optionally, the text determining unit 602 is configured to: dividing the document to be identified into attachment documents according to the characteristics of the preset attachment paragraphs; one or more attachment files; for the attachment document, collecting paragraphs with the same third paragraph characteristics in the attachment document as a preparation paragraph set; one or more prepared paragraph sets; determining all paragraphs between the paragraph with the minimum paragraph number and the paragraph with the maximum paragraph number in the document to be identified, and adding the determined paragraphs into the prepared paragraph set; carrying out merging processing on all the prepared paragraph sets in the attachment document, wherein the merging processing comprises merging the prepared paragraph sets with intersection; no intersection exists between different combined preparatory paragraph sets; traversing the combined preparation paragraph set, and determining the currently traversed preparation paragraph set as a preparation title or a preparation text; determining a sub-document according to the determined preparation title and the preparation text; one or more sub-documents; and aiming at the subdocuments, collecting the paragraphs determined as the prepared text in the subdocuments to obtain a text paragraph set.
Optionally, the hierarchical relationship determining subunit 602a is configured to: adding paragraphs containing numbers with the same format in a text paragraph set into the same text paragraph subset; one or more text paragraph subsets are obtained; traversing paragraphs in the text paragraph subset from small to large according to paragraph numbers aiming at the text paragraph subset; adding paragraphs, of which the paragraph number is greater than or equal to the paragraph number of the currently traversed paragraph and is less than the paragraph number of the next paragraph, in the text paragraph set to the overlay paragraph set of the currently traversed paragraph under the condition that the next paragraph exists; adding the currently traversed paragraph to the overlay paragraph set of the currently traversed paragraph in the absence of a next paragraph; and determining the hierarchical relationship among the paragraphs according to the covering paragraph set of the numbered paragraphs in the text paragraph set.
Optionally, the hierarchical relationship determining subunit 602a is configured to:
traversing paragraphs containing numbers in a text paragraph set according to paragraph numbers and the sequence from small to large;
under the condition that no intersection exists between the covering paragraph set of the currently traversed paragraph and the covering paragraph set of the next paragraph, if the format of the number contained in the currently traversed paragraph is the same as the format of the number contained in the next paragraph, determining that the hierarchical relationship between the currently traversed paragraph and the next paragraph is used for representing the same hierarchy; if the format of the contained number of the currently traversed paragraph is different from the format of the contained number of the next paragraph, determining the hierarchical relationship according to the similar paragraphs;
determining a hierarchical relationship according to the similar paragraphs under the condition that an intersection exists between the coverage paragraph set of the currently traversed paragraph and the coverage paragraph set of the next paragraph;
a similar paragraph is a paragraph that contains a number in the same format as the next paragraph.
Optionally, the hierarchical relationship determining subunit 602a is configured to:
in the case that similar paragraphs exist in the traversed paragraphs, determining the hierarchical relationship between the similar paragraphs and the next paragraph for representing the same hierarchy;
and under the condition that no similar paragraph exists in the traversed paragraphs, determining the hierarchical relationship between the currently traversed paragraph and the next paragraph for characterization, wherein the hierarchy corresponding to the currently traversed paragraph is one level higher than the hierarchy corresponding to the next paragraph.
Optionally, the hierarchical relationship determining subunit 602a is configured to:
traversing the hierarchical relationship between the paragraphs;
determining paragraphs contained in the currently traversed hierarchical relationship according to the currently traversed hierarchical relationship;
determining a format containing numbers in the paragraphs according to the determined paragraphs;
and determining the hierarchy relation of the current traversal as the hierarchy relation between the determined numbering formats.
Optionally, the alternative determining subunit 602b is configured to:
determining a temporary hierarchical structure according to the hierarchical relationship between the determined number formats; the determined temporary hierarchical structure is one or more, one or more levels are included, and different levels are included and correspond to different numbering formats;
and traversing the determined temporary hierarchical structure, and determining the currently traversed temporary hierarchical structure as an alternative hierarchical structure under the condition that the currently traversed temporary hierarchical structure exists in the preset hierarchical structure set.
For a detailed explanation of the embodiments of the device reference is made to the explanations of the embodiments of the method described above.
The embodiment of the present invention further provides an electronic device, as shown in fig. 20, including a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, where the processor 1001, the communication interface 1002 and the memory 1003 complete mutual communication through the communication bus 1004,
a memory 1003 for storing a computer program;
the processor 1001 is configured to implement the method for identifying a title described in any of the above embodiments when executing the program stored in the memory 1003.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute the method of identifying a title described in any one of the above embodiments, or the method of identifying a hierarchical title described above.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of identifying titles described in any one of the above embodiments, or the method of identifying hierarchical titles described above.
In another embodiment provided by the present invention, an electronic device is further provided, which includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; a processor configured to execute the program stored in the memory, the method for identifying a title according to any of the above embodiments, or the method for identifying a hierarchical title according to any of the above embodiments.
In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program is executed by a processor, wherein the method for identifying a title or the method for identifying a hierarchical title is described in any one of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present invention are described in a related manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (21)

1. A method of identifying a title, comprising:
acquiring a document to be identified;
determining a title paragraph set in the document to be identified according to the first paragraph features, wherein the title paragraph set is one or more;
merging the paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set; the title name paragraph set is one or more;
and determining the title name paragraph set meeting the preset main title condition as a main title.
2. The method of claim 1, wherein merging paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set comprises:
and collecting the title paragraphs with the same characteristics and continuous paragraph numbers in the title paragraph set as a title name paragraph set.
3. The method of claim 2, wherein collecting the title paragraphs of the set of title paragraphs that have the same characteristics and consecutive paragraph numbers as a set of title name paragraphs comprises:
traversing the title paragraphs in the set of title paragraphs according to paragraph numbers;
collecting a currently traversed title paragraph and a next title paragraph in a same title name paragraph set in case that a second paragraph feature of the currently traversed title paragraph is the same as the next title paragraph;
in the event that the second paragraph feature of the currently traversed title paragraph is different from the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different title name paragraph sets.
4. The method according to claim 1, wherein the determining the title name paragraph set meeting preset title condition as a title comprises:
traversing the title name paragraph set and judging whether the title name paragraph set meets a preset main title condition;
and under the condition that the title name paragraph set meets the preset main title condition, determining the title name paragraph set as a main title.
5. The method of claim 4, wherein the determining whether the set of title name paragraphs meets a preset subject heading condition comprises:
determining that the title name paragraph set meets the preset main heading condition when a title paragraph with the smallest paragraph number exists in the title paragraphs contained in the title name paragraph set;
and in the case that no title paragraph with the smallest paragraph number exists in the title paragraphs contained in the title paragraph set, determining that the title paragraph set does not meet the preset main title condition.
6. The method of claim 4, further comprising:
and under the condition that the title name paragraph set does not meet the preset main title condition, determining the title name paragraph set as a subheading.
7. The method of claim 4, wherein after traversing the set of title name paragraphs, before determining whether the set of title name paragraphs meet a predetermined subject title condition, the method further comprises:
preprocessing the title name paragraph set to obtain a title character string;
determining the title type according to the key characters contained in the title character string under the condition that the title character string contains any key character in a key character set; the title types correspond to the title name paragraph sets one by one, and any key character in the key character set corresponds to one title type.
8. The method of claim 7, wherein the pre-processing the set of title name paragraphs to obtain a title string comprises:
and splicing the characters contained in the title paragraphs in the title name paragraph set, and deleting preset characters according to a splicing result to obtain a title character string.
9. The method of claim 8, wherein the predetermined characters comprise:
blank characters; and/or
Characters between symbols are preset.
10. The method of claim 7, wherein determining the set of title name paragraphs as a main title comprises:
and marking the title paragraphs in the title name paragraph set according to the main title and the title types corresponding to the title name paragraph set.
11. The method of claim 7, further comprising:
and under the condition that the title name paragraph set does not meet the preset main title condition, marking the title paragraphs in the title name paragraph set according to the subtitles and the title types corresponding to the title name paragraph set.
12. The method of any of claims 2-11, wherein the second paragraph feature comprises: size and/or alignment.
13. The method of claim 1, wherein determining the set of title paragraphs in the document to be identified according to the first paragraph features comprises:
determining one or more title paragraphs in the document to be identified according to the first paragraph characteristics;
traversing the title paragraph;
collecting a currently traversed title paragraph and a next title paragraph in the same title paragraph set if the paragraph number of the currently traversed title paragraph is consecutive to the paragraph number of the next title paragraph;
in the case where the paragraph number of the currently traversed title paragraph is not consecutive with the paragraph number of the next title paragraph, collecting the currently traversed title paragraph and the next title paragraph in different sets of title paragraphs.
14. The method of claim 1, further comprising:
determining a text paragraph set in the document to be identified according to the third paragraph characteristics;
determining an alternative hierarchical structure according to the text paragraph set; the determined alternative hierarchies are one or more;
determining a text hierarchy of the document to be identified in the determined alternative hierarchies;
and determining a hierarchical title according to the text hierarchical structure of the document to be identified.
15. The method of claim 14, wherein determining an alternative hierarchy from the set of text paragraphs comprises:
determining paragraphs containing numbers in the text paragraph set;
determining a hierarchical relationship between paragraphs for the determined paragraphs;
determining the hierarchical relationship among the formats of the numbers contained in the paragraphs according to the hierarchical relationship among the paragraphs;
determining an alternative hierarchical structure according to the hierarchical relationship between the determined number formats; one or more levels are included in the determined alternative hierarchical structure, with different levels included corresponding to different numbering formats.
16. The method of claim 15, wherein determining the text hierarchy of the document to be identified in the determined alternative hierarchies comprises:
traversing the alternative hierarchical structure and determining the number of paragraphs corresponding to the alternative hierarchical structure; wherein the alternative hierarchy corresponding paragraphs are: in the text paragraph set, the numbered paragraphs are contained in the same format as any numbered paragraph in the alternative hierarchical structure;
and after traversing, determining the text hierarchical structure of the document to be identified according to the number of paragraphs corresponding to the alternative hierarchical structure.
17. The method of claim 16, wherein determining the text hierarchy of the document to be identified according to the number of paragraphs corresponding to the alternative hierarchy comprises:
determining the alternative hierarchical structure with the largest number of corresponding paragraphs as the text hierarchical structure of the document to be identified under the condition that the determined alternative hierarchical structure does not contain any preset hierarchical structure;
and under the condition that the determined alternative hierarchical structure comprises any preset hierarchical structure, determining the preset hierarchical structure with the maximum number of corresponding paragraphs as the text hierarchical structure of the document to be identified.
18. An apparatus for recognizing a title, comprising:
the acquisition module is used for acquiring a document to be identified;
the paragraph set determining module is used for determining a title paragraph set in the document to be identified according to the first paragraph characteristics, wherein the title paragraph set is one or more;
a title name determining module, configured to merge paragraphs belonging to the same title name in the title paragraph set to obtain a title name paragraph set; the title name paragraph set is one or more;
and the main title determining module is used for determining the title name paragraph set meeting the preset main title condition as a main title.
19. The apparatus of claim 18, wherein the main title determination module comprises:
the traversal submodule is used for traversing the title name paragraph set and judging whether the title name paragraph set meets the preset main title condition or not;
and the condition submodule is used for determining the title name paragraph set as a main title under the condition that the title name paragraph set meets the preset main title condition.
20. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 17 when executing a program stored in the memory.
21. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-17.
CN202111340991.4A 2021-11-12 2021-11-12 Title identification method and device, electronic equipment and storage medium Pending CN114118074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111340991.4A CN114118074A (en) 2021-11-12 2021-11-12 Title identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111340991.4A CN114118074A (en) 2021-11-12 2021-11-12 Title identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114118074A true CN114118074A (en) 2022-03-01

Family

ID=80379399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111340991.4A Pending CN114118074A (en) 2021-11-12 2021-11-12 Title identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114118074A (en)

Similar Documents

Publication Publication Date Title
JP3425408B2 (en) Document reading device
US8587613B2 (en) System and method for comparing and reviewing documents
US5875263A (en) Non-edit multiple image font processing of records
US6944820B2 (en) Ensuring proper rendering order of bidirectionally rendered text
US20050289182A1 (en) Document management system with enhanced intelligent document recognition capabilities
US20060271519A1 (en) Analyzing externally generated documents in document management system
US11182544B2 (en) User interface for contextual document recognition
CN110688349A (en) Document sorting method, device, terminal and computer readable storage medium
CN112926299B (en) Text comparison method, contract review method and auditing system
JP5380040B2 (en) Document processing device
WO2021108038A1 (en) Systems and methods for extracting and implementing document text according to predetermined formats
JPH11282955A (en) Character recognition device, its method and computer readable storage medium recording program for computer to execute the method
CN111046627A (en) Chinese character display method and system
JP2015005100A (en) Information processor, template generation method, and program
US20090083312A1 (en) Document composition system and method
CN114118074A (en) Title identification method and device, electronic equipment and storage medium
Edhlund et al. NVivo for Mac essentials
CN114139517A (en) Method and system for automatically combining reports based on chapter labels
JP7086424B1 (en) Patent text generator, patent text generator, and patent text generator
CN110457659B (en) Clause document generation method and terminal equipment
CN117454851B (en) PDF document-oriented form data extraction method and device
KR102555809B1 (en) Method and system for converting document form to web form
Edhlund et al. NVivo 12 for Mac Essentials
Kleber et al. Table and Form Analysis Tool P2
JP7377565B2 (en) Drawing search device, drawing database construction device, drawing search system, drawing search method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination