CN110852079A - Document directory automatic generation method and device and computer readable storage medium - Google Patents

Document directory automatic generation method and device and computer readable storage medium Download PDF

Info

Publication number
CN110852079A
CN110852079A CN201910965809.0A CN201910965809A CN110852079A CN 110852079 A CN110852079 A CN 110852079A CN 201910965809 A CN201910965809 A CN 201910965809A CN 110852079 A CN110852079 A CN 110852079A
Authority
CN
China
Prior art keywords
document
target document
title
rule
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910965809.0A
Other languages
Chinese (zh)
Inventor
侯丽
佘昊天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910965809.0A priority Critical patent/CN110852079A/en
Publication of CN110852079A publication Critical patent/CN110852079A/en
Priority to PCT/CN2020/112346 priority patent/WO2021068684A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention relates to an artificial intelligence technology, and discloses an automatic generation method of a document directory, which comprises the following steps: extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title; inputting the initial title rule into a pre-constructed generation confrontation network model for training to obtain a trained title rule; generating a regular expression based on the trained title rule; traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory. The invention also provides a device for automatically generating the document directory and a computer readable storage medium. The invention can realize the function of automatically generating the document catalogue accurately and efficiently.

Description

Document directory automatic generation method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for generating a document directory by deep learning of a document structure and a computer-readable storage medium.
Background
The existing method for extracting a document directory mainly reads a word document through a Point of Interest (POI). The prior art can only read according to paragraphs and cannot identify the specific structure of the document. In addition, under the condition that a document has a multi-level title, the existing method cannot completely and accurately extract the directory structure in the document.
Disclosure of Invention
The invention provides a method and a device for automatically generating a document directory and a computer readable storage medium, and mainly aims to provide a method for deeply learning a target document to obtain the document directory.
In order to achieve the above object, the present invention provides an automatic document directory generation method, including:
extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title;
inputting the initial title rule into a pre-constructed generation confrontation network model for training to obtain a trained title rule;
generating a regular expression based on the trained title rule;
traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory.
Optionally, the method for automatically generating a document directory further includes: constructing the generative confrontation network model, comprising:
establishing a generation model and a discrimination model; and obtaining an optimization solution by mutually game learning of the generated model and the discrimination model, wherein the optimization solution comprises the trained title rules.
Optionally, before generating the regular expression, the method for automatically generating the document directory further includes:
generating a state machine based on the trained title rule; wherein the generating a state machine comprises:
carrying out grammar analysis on the trained title rule, and rewriting the trained title rule into a state machine rule required by the state machine construction; constructing a state machine according to the state machine rule;
and converting the constructed state machine into a format required by generating a regular expression and storing the format.
Optionally, traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, and extracting all the titles of the target document, including:
traversing the whole content of the target document, and extracting one or more interest points from the target document;
extracting the content of the target document through the interest points, and identifying the outline structure of the target document;
and comparing the outline structure of the target document with the regular expression, performing matching analysis, if the content in the target document is matched with the regular expression, confirming that the content in the target document is the title, extracting the title, and if the content in the target document is not matched with the regular expression, confirming that the content in the target document is the text.
Optionally, the document directory is an extensible markup language; the file format of the target document is Microsoft office Word.
In addition, in order to achieve the above object, the present invention further provides an automatic document directory generation apparatus, including a memory and a processor, where the memory stores an automatic document directory generation program operable on the processor, and the automatic document directory generation program implements the following steps when executed by the processor:
extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title;
inputting the initial title rule into a pre-constructed generation confrontation network model for training to obtain a trained title rule;
generating a regular expression based on the trained title rule;
traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory.
Optionally, the method for automatically generating a document directory further includes: constructing the generative confrontation network model, comprising: and obtaining an optimization solution by mutually game learning of the generated model and the discrimination model, wherein the optimization solution comprises the trained title rules.
Optionally, before generating the regular expression, the method for automatically generating the document directory further includes:
generating a state machine based on the trained title rules, wherein the generating the state machine comprises:
carrying out grammar analysis on the trained title rule, and rewriting the trained title rule into a state machine rule required by the state machine construction;
constructing a state machine according to the state machine rule; and converting the constructed state machine into a format required by generating a regular expression and storing the format.
Optionally, traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, and extracting all the titles of the target document, including:
traversing the whole content of the target document, and extracting one or more interest points from the target document;
extracting the content of the target document through the interest points, and identifying the outline structure of the target document;
and comparing the outline structure of the target document with the regular expression, performing matching analysis, if the content in the target document is matched with the regular expression, confirming that the content in the target document is the title, extracting the title, and if the content in the target document is not matched with the regular expression, confirming that the content in the target document is the text.
Optionally, the document directory is an extensible markup language; the file format of the target document is Microsoft office Word.
In addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a document directory automatic generation program, which is executable by one or more processors to implement the steps of the document directory automatic generation method as described above.
The method extracts the initial title rule from the target document, inputs the initial title rule into the pre-constructed generation confrontation network model to train and obtain the trained title rule, and can efficiently analyze by a computer without losing accuracy. And further, configuring a regular expression according to the trained title rule, and finally comparing and analyzing the content in the target document with the regular expression to extract the title. Therefore, the document directory automatic generation method, the document directory automatic generation device and the computer readable storage medium can realize accurate, efficient and coherent document directory generation.
Drawings
FIG. 1 is a flowchart illustrating a method for automatically generating a document directory according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an internal structure of an automatic document directory generation apparatus according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating an automatic document directory generation program in the automatic document directory generation apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a method for automatically generating a document directory. Referring to fig. 1, a flowchart of a method for automatically generating a document directory according to an embodiment of the present invention is shown. The document directory automatic generation method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the method for automatically generating a document directory includes:
s1, extracting an initial title in the target document, and determining an initial title rule of the target document based on the initial title.
The target document is a document object which needs to be generated by a document directory in the invention, wherein the target document is in a word format. For example, the target document may be word text of the novel Bing city; how to read a book, word text, etc. different types of text documents. The invention aims to identify the text content of the target document, extract the content with the chapter characteristics, and sort the content with the chapter characteristics according to a preset rule to form a document directory of the target document.
The invention first extracts the initial title in the target document. The initial title refers to a brief sentence in the target document indicating the contents of articles, works and the like, and is generally divided into a main title, a subheading and the like. Headings may allow the reader to understand the main content and subject matter of the article. Natural segmentation of chapters, paragraphs, etc. can be made using headings.
Further, the present invention determines an initial title rule of the target document based on the initial title, including: after extracting the initial title in the target document, abstracting the general rules in the characteristics of the grammar, the semantic logic, the hierarchical relationship of each title and the like of the initial title based on the concrete form of the initial title (namely, the grammar, the semantic logic and the hierarchical relationship of each title actually contained by the initial title) to determine the initial title rule of the target document. The initial title rule refers to the category, structure, semantic logic and hierarchical relationship of each title of the initial title.
Specifically, the grammar refers to the part of speech to which the specific word belongs in the initial title and the composition and morphological change (morphology) of the word, for example, the text document "animal universe" includes the initial title: mammalian bird reptiles, etc., these headings being grammatical as nouns; the semantic logic adopts a modern logic method to disclose the relationship between language expressions and meanings thereof, for example, a text document 'animal universe' comprises an initial title: a mammalian feline leopard cat, the title cat and leopard both belonging to a mammal according to semantic logic, being a semantically inclusive relationship; the hierarchical relationships are as follows from big to small: chapter x, first section, subsection 1.1.1, (1), and so on. According to the hierarchical relation of all the initial titles in the text; semantic logical association of the title content with other text content can determine the corresponding title rule. Taking the first chapter of the text "research on non-material incentive strategies for knowledge type employees of M corporation" as an example, the section > subsection is passed through the preset hierarchical relationship logic. The highest level is titled as: the first chapter of introduction; the second level is entitled: background on first section of the topic second section of research significance third section of research content fourth section of research method; the third level is entitled: section 1.2.1 theoretical significance section 1.2.2 practical significance.
And S2, inputting the initial title rule into a pre-constructed generated confrontation network model to be trained to obtain a trained title rule.
Preferably, the generating the confrontation network model comprises a generating model and a discriminant model. And obtaining an optimization solution by the generation model and the discrimination model through mutual game learning, wherein the optimization solution comprises the trained title rules.
The initial title rule is input into a generation network model for training to obtain virtual title rule data (a sample G (z)), and then the generated virtual title rule data is judged by a judgment model D to be trained to accord with the characteristics of the initial title rule so as to obtain an optimalization solution after training, wherein the optimalization solution comprises the trained title rule.
For example, the title rules trained in the constructed confrontation network model include:
firstly, inputting the initial title rule as a variable z (hereinafter referred to as variable z) to a pre-constructed generation countermeasure network model;
generating a sample G (z) which follows real data distribution after the model G obtains the input initial title rule;
and then taking the text content of the target document and the sample G (z) as an input data set, wherein the input data set may contain one or all of the text content of the target document and the sample G (z).
Inputting the input data set into the discrimination model D, wherein the discrimination model D is used for discriminating whether the input data is from a generation model G or real data, namely the text content of the target document (the real data refers to the text real specific content of the target document and is not a virtual data sample G (z) generated after learning); if the data in the current input data set is from the sample G (z), the current input data set is marked as 0 and judged as false, otherwise, if the data in the current input data set is not from the sample G (z), the data in the current input data set is from real data, and the current input data set is marked as 1 and judged as true. The goal of generating model G here is to make the appearance of the generated virtual data, the sample G (z), on discriminant model D, consistent with the appearance of the real data (the textual content of the target document) on D.
The mutual gambling learning includes: the mutual game learning and iterative optimization process of the generated model G and the discrimination model D enables the performance of the generated model G and the discrimination model D to be continuously improved, and when the discrimination capability of the discrimination model D is improved and the data source of the input discrimination model D cannot be discriminated, the generated model G is considered to have learned the real data distribution.
Inputting the initial title rule into a generation network model for training to obtain a sample G (z), and judging the generated sample G (z) by a judgment model D so as to train the sample G (z) to accord with the characteristics of the initial title rule. And mutually game learning is carried out between a generation model G and a discrimination model D by inputting the initial title rule to obtain an optimal solution, wherein the optimal solution comprises the trained title rule.
And S3, generating a regular expression based on the trained title rule.
Generating the regular expression according to the method includes:
acquiring the trained title rule;
carrying out syntactic analysis on the trained title rule, and extracting a sentence pattern main body of the trained title rule;
obtaining a semantic slot of the words of the sentence pattern main body;
and generating a regular expression according to the sentence pattern main body, the semantic groove and the rest non-main body part in the trained title rule.
In some embodiments of the present invention, before the configuring the regular expression, further includes:
a state machine is generated based on the trained title rules.
Preferably, the state machine is generated to provide stable configuration means and storage means in the process of generating the regular expression. The state machine in this embodiment is a device that configures and stores regular expressions specifically according to corresponding title rules. And the state machine carries out matching of the regular expressions in the state machine according to the received characters and the position information in the regular expressions.
Wherein the generating the state machine comprises:
s301, carrying out syntax analysis on the trained title rule to obtain a configuration file, wherein the configuration file describes the identification of each state of the trained title rule, the response information and the state conversion information of each event, and describes the hierarchical relationship among a plurality of states.
S302, the state machine is constructed according to the configuration file.
S303, converting the constructed state machine into a format required by the generation of the regular expression and storing the format.
In some embodiments of the present invention, the state machine is composed of a state register and a combinational logic circuit, can perform state transition according to a preset state according to a control signal, and is a control center for coordinating actions of related signals and completing specific operations.
Preferably, the state machine may be represented by data table entries, linked lists, instruction table entries, state diagrams, and the like according to different actual application carriers, which is not limited in this embodiment.
S4, traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory. According to an embodiment of the present invention, reading the target document, comparing the content in the target document with the regular expression, and extracting the title of the target document, includes the following steps:
s401, one or more points of interest (POI) are extracted from the target document.
The method comprises the following steps: obtaining data information of the POI object based on the regular expression, wherein the data information of the POI object at least comprises a sentence pattern main body; and comparing the obtained data information of the POI object with the contents in the target document one by one, and extracting the part of the target document, which has the same rule with the data information of the POI object, as the POI. If yes, the obtained POI object data information is X, Y; and comparing the POI with the content in the target document, and if the POI is found to have the same rule (such as logic rule, grammar rule and the like) as the content in the target document, extracting the part as the POI.
The interest points are open source function libraries of an Apache software foundation, and the POI provides API for Java programs to read and write Microsoft Office format archives.
In the preferred embodiment of the present invention, the POI technology is used to read the text paragraph content of the target document Word. Wherein, a Word document comprises a plurality of paragraphs, one paragraph comprises a plurality of Runs, one Runs comprises a plurality of Runs, and the Runs is the minimum unit of the target document. For example, one of the paragraphs contains several complete sentences, i.e., the Runs; the Run also includes several phrases, namely the Run. Specifically, the steps of reading Word text content through POI are as follows:
(1) firstly, operating XWPFParagraph in XWPFDoccuent through POI to obtain all paragraphs of a target document;
(2) getruns () command gets all Runs in a paragraph:
(3) get a Run in a Run by xwpfrun.
(4) And traversing the document in a whole manner, and acquiring the title in the word document through a getPr (). getOutlineLvl () command.
And traversing the content of the word document based on the process and extracting all the titles in the word document.
S402, extracting the content of the target document through a point of interest (POI), and identifying the outline structure of the target document. The step of identifying the outline structure of the target document is to arrange all the proposed titles in a hierarchical and sequential manner according to semantic logic and front-back sequence to form the outline structure.
S403, comparing, matching and analyzing the outline structure of the target document and the regular expression, if the content in the target document is matched with the regular expression, determining that the content in the target document is the title, and if the content in the target document is not matched with the regular expression, determining that the content in the target document is the text.
And S5, traversing all the contents of the target document, extracting all the titles, arranging all the titles according to the traversal sequence, and generating a document directory.
In a preferred embodiment of the invention, all paragraph contents of the word document are traversed, document content titles are contrastingly identified based on the regular expression, document titles are successively traversed and refined according to the sequence of the document contents, and the document titles are integrated into a new document, namely the extracted complete word document chapter directory.
The invention also provides an automatic generation device of the document directory. Referring to fig. 2, there is shown a schematic diagram of an internal structure of an automatic document directory generation apparatus according to an embodiment of the present invention.
In the present embodiment, the automatic document list generation device 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The document directory automatic generation apparatus 1 includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the document directory automatic generation apparatus 1 in some embodiments, for example, a hard disk of the document directory automatic generation apparatus 1. The memory 11 may also be an external storage device of the automatic document directory generation apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the automatic document directory generation apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the document directory automatic generation apparatus 1. The memory 11 may be used not only to store application software installed in the document directory automatic generation apparatus 1 and various types of data, such as the code of the document directory automatic generation program 01, but also to temporarily store data that has been output or is to be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, is configured to execute program code or process data stored in memory 11, such as executing document catalog auto-generation program 01.
The communication bus 13 is used to realize connection communication between these components.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.
Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the document catalog automatic generation apparatus 1 and for displaying a visualized user interface.
Fig. 2 shows only the document directory automatic generation apparatus 1 having the components 11 to 14 and the document directory automatic generation program 01, and it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the document directory automatic generation apparatus 1, and may include fewer or more components than those shown, or combine some components, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 2, a document directory automatic generation program 01 is stored in the memory 11; the following steps are implemented when the processor 12 executes the document directory automatic generation program 01 stored in the memory 11:
step one, extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title.
The target document is a document object which needs to be generated by a document directory in the invention, wherein the target document is in a word format. For example, the target document may be word text of the novel Bing city; how to read a book, word text, etc. different types of text documents. The invention aims to identify the text content of the target document, extract the content with the chapter characteristics, and sort the content with the chapter characteristics according to a preset rule to form a document directory of the target document.
The invention first extracts the initial title in the target document. The initial title refers to a brief sentence in the target document indicating the contents of articles, works and the like, and is generally divided into a main title, a subheading and the like. Headings may allow the reader to understand the main content and subject matter of the article. Natural segmentation of chapters, paragraphs, etc. can be made using headings.
Further, the present invention determines an initial title rule of the target document based on the initial title, including: after extracting the initial title in the target document, abstracting the general rules in the characteristics of the grammar, the semantic logic, the hierarchical relationship of each title and the like of the initial title based on the concrete form of the initial title (namely, the grammar, the semantic logic and the hierarchical relationship of each title actually contained by the initial title) to determine the initial title rule of the target document. The initial title rule refers to the category, structure, semantic logic and hierarchical relationship of each title of the initial title. The grammar refers to the part of speech to which the specific word belongs in the initial title, and the composition and morphological change (morphology) of the word, for example, the text document "animal universe" includes the initial title: mammalian bird reptiles, etc., these headings being grammatical as nouns; the semantic logic adopts a modern logic method to disclose the relationship between language expressions and meanings thereof, for example, a text document 'animal universe' comprises an initial title: a mammalian feline leopard cat, the title cat and leopard both belonging to a mammal according to semantic logic, being a semantically inclusive relationship; the hierarchical relationships are as follows from big to small: chapter x, first section, 1.1.1, (1), etc. According to the hierarchical relation of all the initial titles in the text; semantic logical association of the title content with other text content can determine the corresponding title rule. Taking the text "research on non-material incentive strategies for knowledge type employees of M corporation" chapter I as an example, the highest-level title is: the first chapter of introduction; the second level is entitled: background on first section of the topic second section of research significance third section of research content fourth section of research method; the third level is entitled: 1.2.1 theoretical meaning 1.2.2 practical meaning.
And step two, inputting the initial title rule into a pre-constructed generated confrontation network model to be trained to obtain a trained title rule.
Preferably, the generating the confrontation network model comprises a generating model and a discriminant model. And obtaining an optimization solution by the generation model and the discrimination model through mutual game learning, wherein the optimization solution comprises the trained title rules.
The initial title rule is input into a generation network model for training to obtain virtual title rule data (a sample G (z)), and then the generated virtual title rule data is judged by a judgment model D to be trained to accord with the characteristics of the initial title rule so as to obtain an optimalization solution after training, wherein the optimalization solution comprises the trained title rule.
For example, the title rules trained in the constructed confrontation network model include:
firstly, inputting the initial title rule as a variable z (hereinafter referred to as variable z) to a pre-constructed generation countermeasure network model;
generating a sample G (z) which follows real data distribution after the model G obtains the input initial title rule;
and then taking the text content of the target document and the sample G (z) as an input data set, wherein the input data set may contain one or all of the text content of the target document and the sample G (z).
Inputting the input data set into the discrimination model D, wherein the discrimination model D is used for discriminating whether the input data is from a generation model G or real data, namely the text content of the target document (the real data refers to the text real specific content of the target document and is not a virtual data sample G (z) generated after learning); if the data in the current input data set is from the sample G (z), the current input data set is marked as 0 and judged as false, otherwise, if the data in the current input data set is not from the sample G (z), the data in the current input data set is from real data, and the current input data set is marked as 1 and judged as true. The goal of generating model G here is to make the appearance of the generated virtual data, the sample G (z), on discriminant model D, consistent with the appearance of the real data (the textual content of the target document) on D.
The mutual gambling learning includes: the mutual game learning and iterative optimization process of the generated model G and the discrimination model D enables the performance of the generated model G and the discrimination model D to be continuously improved, and when the discrimination capability of the discrimination model D is improved and the data source of the input discrimination model D cannot be discriminated, the generated model G is considered to have learned the real data distribution.
Inputting the initial title rule into a generation network model for training to obtain a sample G (z), and judging the generated sample G (z) by a judgment model D so as to train the sample G (z) to accord with the characteristics of the initial title rule. And mutually game learning is carried out between a generation model G and a discrimination model D by inputting the initial title rule to obtain an optimal solution, wherein the optimal solution comprises the trained title rule.
And thirdly, generating a regular expression based on the trained title rule.
Generating the regular expression according to the method includes:
acquiring the trained title rule;
carrying out syntactic analysis on the trained title rule, and extracting a sentence pattern main body of the trained title rule;
obtaining a semantic slot of the words of the sentence pattern main body;
and generating a regular expression according to the sentence pattern main body, the semantic groove and the rest non-main body part in the trained title rule.
In some embodiments of the present invention, before the configuring the regular expression, further includes:
a state machine is generated based on the trained title rules.
Preferably, the state machine is generated to provide stable configuration means and storage means in the process of generating the regular expression. The state machine in this embodiment is a device that configures and stores regular expressions specifically according to corresponding title rules. And the state machine carries out matching of the regular expressions in the state machine according to the received characters and the position information in the regular expressions.
Wherein the generating the state machine comprises:
s301, carrying out syntax analysis on the trained title rule to obtain a configuration file, wherein the configuration file describes the identification of each state of the trained title rule, the response information and the state conversion information of each event, and describes the hierarchical relationship among a plurality of states.
S302, the state machine is constructed according to the configuration file.
S303, converting the constructed state machine into a format required by the generation of the regular expression and storing the format.
In some embodiments of the present invention, the state machine is composed of a state register and a combinational logic circuit, can perform state transition according to a preset state according to a control signal, and is a control center for coordinating actions of related signals and completing specific operations.
Preferably, the state machine may be represented by data table entries, linked lists, instruction table entries, state diagrams, and the like according to different actual application carriers, which is not limited in this embodiment.
And step four, traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory. According to an embodiment of the present invention, reading the target document, comparing the content in the target document with the regular expression, and extracting the title of the target document, includes the following steps:
s401, one or more points of interest (POI) are extracted from the target document.
The method comprises the following steps: obtaining data information of the POI object based on the regular expression, wherein the data information of the POI object at least comprises a sentence pattern main body; and comparing the obtained data information of the POI object with the contents in the target document one by one, and extracting the part of the target document, which has the same rule with the data information of the POI object, as the POI. If yes, the obtained POI object data information is X, Y; and comparing the POI with the content in the target document, and if the POI is found to have the same rule (such as logic rule, grammar rule and the like) as the content in the target document, extracting the part as the POI.
The interest points are open source function libraries of an Apache software foundation, and the POI provides API for Java programs to read and write Microsoft Office format archives.
In the preferred embodiment of the present invention, the POI technology is used to read the text paragraph content of the target document Word. Wherein, a Word document comprises a plurality of paragraphs, one paragraph comprises a plurality of Runs, one Runs comprises a plurality of Runs, and the Runs is the minimum unit of the target document. For example, one of the paragraphs contains several complete sentences, i.e., the Runs; the Run also includes several phrases, namely the Run. Specifically, the steps of reading Word text content through POI are as follows:
(1) firstly, operating XWPFParagraph in XWPFDoccuent through POI to obtain all paragraphs of a target document;
(2) getruns () command gets all Runs in a paragraph:
(3) get a Run in a Run by xwpfrun.
(4) And traversing the document in a whole manner, and acquiring the title in the word document through a getPr (). getOutlineLvl () command.
And traversing the content of the word document based on the process and extracting all the titles in the word document.
S402, extracting the content of the target document through a point of interest (POI), and identifying the outline structure of the target document. The step of identifying the outline structure of the target document is to arrange all the proposed titles in a hierarchical and sequential manner according to semantic logic and front-back sequence to form the outline structure.
S403, comparing, matching and analyzing the outline structure of the target document and the regular expression, if the content in the target document is matched with the regular expression, determining that the content in the target document is the title, and if the content in the target document is not matched with the regular expression, determining that the content in the target document is the text.
And step five, traversing all contents of the target document, extracting all titles, arranging all the titles according to the traversal sequence, and generating a document directory.
In a preferred embodiment of the invention, all paragraph contents of the word document are traversed, document content titles are contrastingly identified based on the regular expression, document titles are successively traversed and refined according to the sequence of the document contents, and the document titles are integrated into a new document, namely the extracted complete word document chapter directory.
Alternatively, in other embodiments, the document directory automatic generation program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, a schematic diagram of program modules of an automatic document directory generation program in an embodiment of an automatic document directory generation apparatus according to the present invention is shown, in this embodiment, the automatic document directory generation program may be divided into a data receiving and processing module 10, a regular expression configuration module 20, a model training module 30, and a document directory output module 40, which exemplarily:
the data receiving and processing module 10 is configured to: an initial title in a target document is received, and a title rule of the target document is determined based on the initial title.
The regular expression configuration module 20 is configured to: and configuring a regular expression based on the trained title rule.
The model training module 30 is configured to: and inputting the initial title rule into a pre-constructed generated confrontation network model to be trained to obtain the trained title rule.
The document catalog output module 40 is configured to: receiving a target document input by a user, determining a title rule of the target document, inputting the trained title rule and a configuration regular expression into the document directory automatic generation model to generate a document directory, and outputting the document directory.
The functions or operation steps implemented by the data receiving and processing module 10, the regular expression configuration module 20, the model training module 30, the document directory output module 40 and other program modules are substantially the same as those of the above embodiments, and are not repeated herein.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a document directory automatic generation program is stored on the computer-readable storage medium, where the document directory automatic generation program is executable by one or more processors to implement the following operations:
extracting an initial title in a target document, and determining a title rule of the target document based on the initial title.
And performing word vectorization and word vector coding on the primary article data set and the primary abstract data set to respectively obtain a training set and a label set.
And inputting the initial title rule into a pre-constructed generated confrontation network model to be trained to obtain the trained title rule.
And configuring a regular expression based on the trained title rule.
Reading the target document, comparing and analyzing the content in the target document with the regular expression, and extracting the title.
And traversing all contents of the target document, extracting all the titles, arranging all the titles according to the traversal sequence, and generating a document directory.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An automatic generation method of a document directory is characterized by comprising the following steps:
extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title;
inputting the initial title rule into a pre-constructed generation confrontation network model for training to obtain a trained title rule;
generating a regular expression based on the trained title rule;
traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory.
2. The document directory automatic generation method according to claim 1, wherein the document directory automatic generation method further comprises: constructing the generative confrontation network model, comprising:
establishing a generation model and a discrimination model;
and obtaining an optimization solution by mutually game learning of the generated model and the discrimination model, wherein the optimization solution comprises the trained title rules.
3. The method of automatically generating a document directory of claim 2, wherein prior to the generating the regular expression, the method of automatically generating a document directory further comprises:
generating a state machine based on the trained title rule;
wherein the generating a state machine comprises:
carrying out grammar analysis on the trained title rule, and rewriting the trained title rule into a state machine rule required by the state machine construction;
constructing a state machine according to the state machine rule;
and converting the constructed state machine into a format required by generating a regular expression and storing the format.
4. The method for automatically generating a document directory according to claim 3, wherein traversing all contents of the target document, comparing and analyzing the contents of the target document with the regular expression, and extracting all the titles of the target document comprises:
traversing the whole content of the target document, and extracting one or more interest points from the target document;
extracting the content of the target document through the interest points, and identifying the outline structure of the target document;
and comparing the outline structure of the target document with the regular expression, performing matching analysis, if the content in the target document is matched with the regular expression, confirming that the content in the target document is the title, extracting the title, and if the content in the target document is not matched with the regular expression, confirming that the content in the target document is the text.
5. The document directory automatic generation method according to any one of claims 1 to 4, characterized in that:
the document directory is an extensible markup language;
the file format of the target document is Microsoft Office Word.
6. An apparatus for automatically generating a document directory, the apparatus comprising a memory and a processor, the memory having stored thereon a document directory automatic generation program operable on the processor, the document directory automatic generation program when executed by the processor implementing the steps of:
extracting an initial title in a target document, and determining an initial title rule of the target document based on the initial title;
inputting the initial title rule into a pre-constructed generation confrontation network model for training to obtain a trained title rule;
generating a regular expression based on the trained title rule;
traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, extracting all the titles of the target document, arranging all the titles according to the traversal sequence, and generating a document directory.
7. The document catalog automatic generation apparatus according to claim 6,
the automatic generation method of the document directory further comprises the following steps: constructing the generative confrontation network model, comprising:
establishing a generation model and a discrimination model;
and obtaining an optimization solution by mutually game learning of the generated model and the discrimination model, wherein the optimization solution comprises the trained title rules.
8. The apparatus according to claim 7, wherein before said configuring the regular expression, the method further comprises:
generating a state machine based on the trained title rule;
wherein the generating a state machine comprises:
carrying out grammar analysis on the trained title rule, and rewriting the trained title rule into a state machine rule required by the state machine construction;
constructing a state machine according to the state machine rule;
and converting the constructed state machine into a format required by generating a regular expression and storing the format.
9. The apparatus for automatically generating a document catalog according to claim 8, wherein said document catalog generation unit generates said document catalog according to a predetermined rule
Traversing all the contents of the target document, comparing and analyzing the contents of the target document with the regular expression, and extracting all the titles of the target document, wherein the steps comprise:
traversing the whole content of the target document, and extracting one or more interest points from the target document;
extracting the content of the target document through the interest points, and identifying the outline structure of the target document;
and comparing the outline structure of the target document with the regular expression, performing matching analysis, if the content in the target document is matched with the regular expression, confirming that the content in the target document is the title, extracting the title, and if the content in the target document is not matched with the regular expression, confirming that the content in the target document is the text.
10. A computer-readable storage medium having stored thereon a document directory auto-generation program executable by one or more processors to perform the steps of the document directory auto-generation method according to any one of claims 1 to 5.
CN201910965809.0A 2019-10-11 2019-10-11 Document directory automatic generation method and device and computer readable storage medium Pending CN110852079A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910965809.0A CN110852079A (en) 2019-10-11 2019-10-11 Document directory automatic generation method and device and computer readable storage medium
PCT/CN2020/112346 WO2021068684A1 (en) 2019-10-11 2020-08-31 Method and apparatus for automatically generating document directory, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910965809.0A CN110852079A (en) 2019-10-11 2019-10-11 Document directory automatic generation method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110852079A true CN110852079A (en) 2020-02-28

Family

ID=69597308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910965809.0A Pending CN110852079A (en) 2019-10-11 2019-10-11 Document directory automatic generation method and device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110852079A (en)
WO (1) WO2021068684A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400259A (en) * 2020-03-24 2020-07-10 中孚信息股份有限公司 Directory content traversal method
CN111399900A (en) * 2020-03-10 2020-07-10 山东汇贸电子口岸有限公司 API document automatic generation method and system based on python and regular expression
CN111737985A (en) * 2020-07-27 2020-10-02 江苏联著实业股份有限公司 Method and device for extracting process system from article title hierarchical structure
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
WO2021068684A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for automatically generating document directory, computer device and storage medium
CN113642320A (en) * 2020-04-27 2021-11-12 北京庖丁科技有限公司 Method, device, equipment and medium for extracting document directory structure
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
US11940960B2 (en) 2022-04-21 2024-03-26 Folder Front, LLC Intelligent folder-based data organization system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190104656A (en) * 2018-03-02 2019-09-11 최성우 Method and apparatus for extracting title on text
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
CN110019768B (en) * 2019-03-28 2021-09-21 北京寓乐世界教育科技有限公司 Method and device for generating text abstract
CN110175322A (en) * 2019-05-22 2019-08-27 北京神州泰岳软件股份有限公司 A kind of structural method and device of document
CN110852079A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Document directory automatic generation method and device and computer readable storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068684A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for automatically generating document directory, computer device and storage medium
CN111399900A (en) * 2020-03-10 2020-07-10 山东汇贸电子口岸有限公司 API document automatic generation method and system based on python and regular expression
CN111399900B (en) * 2020-03-10 2023-04-07 山东汇贸电子口岸有限公司 API document automatic generation method and system based on python and regular expression
CN111400259B (en) * 2020-03-24 2023-04-21 中孚信息股份有限公司 Method for traversing directory contents
CN111400259A (en) * 2020-03-24 2020-07-10 中孚信息股份有限公司 Directory content traversal method
CN113642320A (en) * 2020-04-27 2021-11-12 北京庖丁科技有限公司 Method, device, equipment and medium for extracting document directory structure
CN111737985B (en) * 2020-07-27 2021-02-12 江苏联著实业股份有限公司 Method and device for extracting process system from article title hierarchical structure
CN111737985A (en) * 2020-07-27 2020-10-02 江苏联著实业股份有限公司 Method and device for extracting process system from article title hierarchical structure
WO2022048211A1 (en) * 2020-09-03 2022-03-10 平安科技(深圳)有限公司 Document directory generation method and apparatus, electronic device and readable storage medium
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
CN112016273B (en) * 2020-09-03 2024-03-12 平安科技(深圳)有限公司 Document catalog generation method, device, electronic equipment and readable storage medium
US11940960B2 (en) 2022-04-21 2024-03-26 Folder Front, LLC Intelligent folder-based data organization system
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information

Also Published As

Publication number Publication date
WO2021068684A1 (en) 2021-04-15

Similar Documents

Publication Publication Date Title
US10664660B2 (en) Method and device for extracting entity relation based on deep learning, and server
CN110852079A (en) Document directory automatic generation method and device and computer readable storage medium
CN109766540B (en) General text information extraction method and device, computer equipment and storage medium
US10725836B2 (en) Intent-based organisation of APIs
RU2605077C2 (en) Method and system for storing and searching information extracted from text documents
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN108345672A (en) Intelligent response method, electronic device and storage medium
US20160364377A1 (en) Language Processing And Knowledge Building System
CN111695345B (en) Method and device for identifying entity in text
CN110427614B (en) Construction method and device of paragraph level, electronic equipment and storage medium
KR20100038378A (en) A method, system and computer program for intelligent text annotation
CN109284370B (en) Mobile application description and permission fidelity determination method and device based on deep learning
CN111656453A (en) Hierarchical entity recognition and semantic modeling framework for information extraction
CN109086274A (en) English social media short text time expression recognition method based on restricted model
CN114003725A (en) Information annotation model construction method and information annotation generation method
Guo et al. MAT: A simple yet strong baseline for identifying self-admitted technical debt
Sarmento et al. Repentino–a wide-scope gazetteer for entity recognition in portuguese
CN109300550B (en) Medical data relation mining method and device
CN114385819B (en) Environment judicial domain ontology construction method and device and related equipment
CN112965909B (en) Test data, test case generation method and system and storage medium
Confort et al. Learning ontology from text: a storytelling exploratory case study
CN112364649B (en) Named entity identification method and device, computer equipment and storage medium
CN109933788B (en) Type determining method, device, equipment and medium
Eyecioglu et al. Knowledge-lean paraphrase identification using character-based features
CN114691820A (en) Question-answering implementation method and device based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination