CN112183077A

CN112183077A - Mode recognition-based official document abstract extraction method and system

Info

Publication number: CN112183077A
Application number: CN202011091166.0A
Authority: CN
Inventors: 蓝建敏; 池沐霖
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-01-05

Abstract

The invention relates to a method and a system for abstracting a document abstract based on pattern recognition, wherein the method comprises the following steps: acquiring a document text of target content to be extracted; judging whether the official document text contains target content according to a literary style mode to obtain a judgment result; and if the interpretation result is yes, extracting the target content from the official document text. The method extracts the target content according to the literary mode, and is high in accuracy, strong in pertinence and high in practicability. The time for reading mass official documents is reduced, and the working efficiency is improved.

Description

Mode recognition-based official document abstract extraction method and system

Technical Field

The invention relates to the technical field of text extraction, in particular to a method and a system for extracting a brief abstract based on pattern recognition.

Background

In general, text summarization technology mainly utilizes a computer to rapidly process and automatically summarize the core content of a text. The task of the automatic summarization technology is to extract words, phrases and sentences with high article summarization from text chapters, so that a user can judge the text value according to automatically summarized core content, and the speed of accurately acquiring information by the user is improved. The abstract extraction technology comprehensively applies various technologies, including natural language word segmentation, statistics, domain ontology, text relation graph, association model and the like.

The text abstract can be divided into an extraction method and a generation method from the generation method. The document type can be divided into single document abstract and multiple document abstract. The graph-based algorithm in the extraction model is a commonly used method at present, the association graph relation of words and sentences in an article is constructed by cutting words and taking the sentences as dimensions, and important nodes in the association graph relation are extracted according to the characteristics of graph nodes to form an abstract, wherein the representative algorithm is textrank. The abstract generation based on deep learning is a relatively representative one of the generative models, specifically, a large amount of texts and corresponding abstract are prepared to form a training set for supervised training, and a representative algorithm is seq2seq + attention.

The official documents are written materials which are formed and used by legal authorities and organizations in official business activities according to specific body types and through certain processing procedures. Compared with the media report text, the official document has the characteristics of long content space and high abstract layer degree. If the existing mathematical algorithm is adopted, the whole content is often not reflected by extracting words and short sentences from a long text and then generating a section of abstract. We analyze the existing official document data and find that the abstract contents are written in the official document by the official document manuscript-imitating person. Therefore, the direction of the official document abstract is changed to find one or more sentences of text contents capable of reflecting the official document abstract from the official document.

Disclosure of Invention

The invention aims to provide a method and a system for extracting a document abstract based on pattern recognition, which are used for quickly and accurately extracting the purpose and the basis of a document to serve as the content of the document abstract.

In order to achieve the purpose, the invention provides the following scheme:

a method for abstracting official document based on pattern recognition comprises the following steps:

acquiring a document text of target content to be extracted;

judging whether the official document text contains target content according to a literary style mode to obtain a judgment result;

and if the interpretation result is yes, extracting the target content from the official document text.

Optionally, the line pattern is a different category of line pattern rule obtained according to the historical document structure and the paragraph.

Optionally, the literary mode comprises a literary purpose, a literary basis and literary content.

Optionally, the extracting the target content from the official document text specifically includes: and extracting target content from the official document text according to an extraction rule.

Optionally, if the determination result is negative, the extraction is not performed.

A system for abstracting official document based on pattern recognition comprises:

the text acquisition module is used for acquiring the official document text of the target content to be extracted;

the judging module is used for judging whether the official document text contains target content according to the line mode to obtain a judging result;

and the extraction module is used for extracting the target content from the official document text when the official document text contains the target content.

Optionally, the extraction module includes an extraction unit, and the extraction unit is configured to extract the target content from the official document text according to an extraction rule.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a method and a system for abstracting a document abstract based on pattern recognition, wherein the method comprises the following steps: acquiring a document text of target content to be extracted; judging whether the official document text contains target content according to a literary style mode to obtain a judgment result; and if the interpretation result is yes, extracting the target content from the official document text. The method extracts the target content according to the literary mode, and is high in accuracy, strong in pertinence and high in practicability. The time for reading mass official documents is reduced, and the working efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention;

fig. 2 is a process diagram of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention;

fig. 3 is a system block diagram of a document summarization extraction system based on pattern recognition according to embodiment 2 of the present invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a method and a system for extracting a document abstract based on pattern recognition, so as to quickly and accurately extract a target text, reduce time cost and improve working efficiency.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

Fig. 1 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention, as shown in fig. 1, the method includes:

step 101: and acquiring the official document text of the target content to be extracted.

Step 102: and judging whether the official document text contains target content according to the line mode to obtain a judgment result. Preferably, the line patterns in this step are different categories of line pattern rules obtained according to the historical official document structure and the paragraphs. The line mode comprises a line purpose, a line basis and line contents.

Step 103: and if the interpretation result is yes, extracting the target content from the official document text. The method specifically comprises the following steps: and extracting target content from the official document text according to an extraction rule.

In this embodiment, the method further includes:

step 104: and if the judgment result is negative, not extracting.

In addition, when the official document is forwarded once or many times, the method for extracting the abstract of the official document based on pattern recognition provided by the embodiment can also be positioned in the originally forwarded official document to match the three sentence pattern rules through the preset three sentence pattern templates of the purpose of the official document, the basis of the official document and the main content. Fig. 2 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention.

The following are specific explanations of the purpose, basis and content of the text in this embodiment:

the purpose of the literary work is to refer to the primary purpose of writing a document, namely, the intent of a sending office to desire to achieve a certain purpose. In this sentence pattern, for example, by document: if the text is consistent with the sentence rule of the sentence, the official document is judged to contain the target content text, and the paragraph content containing the sentence rule of the sentence is extracted.

The basis of the travel of the subject is the basis of the travel of the subject. In this sentence pattern, for example, by document: if the text according with the sentence rule is conformed to the line text, the text containing the target content is judged, and the paragraph content containing the line text according with the sentence rule is extracted.

The main content refers to the content of each section of the document which is summarized, and generally, the main structure in the document. In this sentence pattern, for example, by document: "(one), (.. main content). (ii), (. major content). If the text conforming to the main content sentence pattern rule, the document is judged to contain the target content text and the paragraph content containing the main content sentence pattern rule is extracted.

Example 2

Fig. 3 is a system block diagram of a document summarization extraction system based on pattern recognition according to embodiment 2 of the present invention, and as shown in fig. 3, the system includes:

the text obtaining module 201 is configured to obtain a document text of the target content to be extracted.

The judging module 202 is configured to judge whether the official document text includes the target content according to the line mode, and obtain a judgment result.

And the extraction module 203 is used for extracting the target content from the official document text when the official document text contains the target content.

In this embodiment, the extraction module 203 includes an extraction unit, which is configured to extract the target content from the official document text according to an extraction rule.

(1) the method has the advantages of high accuracy, strong pertinence and high practicability, is suitable for extracting important contents of documents, and can meet the requirement of workers on extracting important contents of texts.

(2) According to the method, a set of extraction rules and a set of extraction methods are established through sentence patterns and line patterns, the problem that the efficiency of extracting important contents of texts by workers is low is solved, the workers are helped to quickly and accurately extract the important contents from massive text information, the time for manually extracting text data is shortened, and the file searching and office text handling efficiency is improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for abstracting official document based on pattern recognition is characterized by comprising the following steps:

acquiring a document text of target content to be extracted;

2. The method for abstracting official document based on pattern recognition of claim 1, wherein said line pattern is a different category of line sentence pattern rule obtained from historical official document structure and paragraphs.

3. The method for abstracting official document based on pattern recognition of claim 1 or 2, wherein the literary pattern comprises literary purposes, literary bases and literary contents.

4. The method for abstracting an official document abstract based on pattern recognition according to claim 1, wherein the abstracting the target content from the official document text specifically comprises: and extracting target content from the official document text according to an extraction rule.

5. The method for abstracting a brief summary based on pattern recognition of claim 1, wherein if the determination result is negative, then no abstraction is performed.

6. A system for abstracting official document based on pattern recognition is characterized by comprising:

7. The system for abstracting an official document based on pattern recognition as claimed in claim 6, wherein said abstraction module comprises an abstraction unit for abstracting target contents from the official document text according to an abstraction rule.