CN112183077A - Mode recognition-based official document abstract extraction method and system - Google Patents

Mode recognition-based official document abstract extraction method and system Download PDF

Info

Publication number
CN112183077A
CN112183077A CN202011091166.0A CN202011091166A CN112183077A CN 112183077 A CN112183077 A CN 112183077A CN 202011091166 A CN202011091166 A CN 202011091166A CN 112183077 A CN112183077 A CN 112183077A
Authority
CN
China
Prior art keywords
official document
target content
text
abstracting
pattern recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011091166.0A
Other languages
Chinese (zh)
Inventor
蓝建敏
池沐霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202011091166.0A priority Critical patent/CN112183077A/en
Publication of CN112183077A publication Critical patent/CN112183077A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention relates to a method and a system for abstracting a document abstract based on pattern recognition, wherein the method comprises the following steps: acquiring a document text of target content to be extracted; judging whether the official document text contains target content according to a literary style mode to obtain a judgment result; and if the interpretation result is yes, extracting the target content from the official document text. The method extracts the target content according to the literary mode, and is high in accuracy, strong in pertinence and high in practicability. The time for reading mass official documents is reduced, and the working efficiency is improved.

Description

Mode recognition-based official document abstract extraction method and system
Technical Field
The invention relates to the technical field of text extraction, in particular to a method and a system for extracting a brief abstract based on pattern recognition.
Background
In general, text summarization technology mainly utilizes a computer to rapidly process and automatically summarize the core content of a text. The task of the automatic summarization technology is to extract words, phrases and sentences with high article summarization from text chapters, so that a user can judge the text value according to automatically summarized core content, and the speed of accurately acquiring information by the user is improved. The abstract extraction technology comprehensively applies various technologies, including natural language word segmentation, statistics, domain ontology, text relation graph, association model and the like.
The text abstract can be divided into an extraction method and a generation method from the generation method. The document type can be divided into single document abstract and multiple document abstract. The graph-based algorithm in the extraction model is a commonly used method at present, the association graph relation of words and sentences in an article is constructed by cutting words and taking the sentences as dimensions, and important nodes in the association graph relation are extracted according to the characteristics of graph nodes to form an abstract, wherein the representative algorithm is textrank. The abstract generation based on deep learning is a relatively representative one of the generative models, specifically, a large amount of texts and corresponding abstract are prepared to form a training set for supervised training, and a representative algorithm is seq2seq + attention.
The official documents are written materials which are formed and used by legal authorities and organizations in official business activities according to specific body types and through certain processing procedures. Compared with the media report text, the official document has the characteristics of long content space and high abstract layer degree. If the existing mathematical algorithm is adopted, the whole content is often not reflected by extracting words and short sentences from a long text and then generating a section of abstract. We analyze the existing official document data and find that the abstract contents are written in the official document by the official document manuscript-imitating person. Therefore, the direction of the official document abstract is changed to find one or more sentences of text contents capable of reflecting the official document abstract from the official document.
Disclosure of Invention
The invention aims to provide a method and a system for extracting a document abstract based on pattern recognition, which are used for quickly and accurately extracting the purpose and the basis of a document to serve as the content of the document abstract.
In order to achieve the purpose, the invention provides the following scheme:
a method for abstracting official document based on pattern recognition comprises the following steps:
acquiring a document text of target content to be extracted;
judging whether the official document text contains target content according to a literary style mode to obtain a judgment result;
and if the interpretation result is yes, extracting the target content from the official document text.
Optionally, the line pattern is a different category of line pattern rule obtained according to the historical document structure and the paragraph.
Optionally, the literary mode comprises a literary purpose, a literary basis and literary content.
Optionally, the extracting the target content from the official document text specifically includes: and extracting target content from the official document text according to an extraction rule.
Optionally, if the determination result is negative, the extraction is not performed.
A system for abstracting official document based on pattern recognition comprises:
the text acquisition module is used for acquiring the official document text of the target content to be extracted;
the judging module is used for judging whether the official document text contains target content according to the line mode to obtain a judging result;
and the extraction module is used for extracting the target content from the official document text when the official document text contains the target content.
Optionally, the extraction module includes an extraction unit, and the extraction unit is configured to extract the target content from the official document text according to an extraction rule.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a method and a system for abstracting a document abstract based on pattern recognition, wherein the method comprises the following steps: acquiring a document text of target content to be extracted; judging whether the official document text contains target content according to a literary style mode to obtain a judgment result; and if the interpretation result is yes, extracting the target content from the official document text. The method extracts the target content according to the literary mode, and is high in accuracy, strong in pertinence and high in practicability. The time for reading mass official documents is reduced, and the working efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention;
fig. 2 is a process diagram of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention;
fig. 3 is a system block diagram of a document summarization extraction system based on pattern recognition according to embodiment 2 of the present invention.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for extracting a document abstract based on pattern recognition, so as to quickly and accurately extract a target text, reduce time cost and improve working efficiency.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1
Fig. 1 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention, as shown in fig. 1, the method includes:
step 101: and acquiring the official document text of the target content to be extracted.
Step 102: and judging whether the official document text contains target content according to the line mode to obtain a judgment result. Preferably, the line patterns in this step are different categories of line pattern rules obtained according to the historical official document structure and the paragraphs. The line mode comprises a line purpose, a line basis and line contents.
Step 103: and if the interpretation result is yes, extracting the target content from the official document text. The method specifically comprises the following steps: and extracting target content from the official document text according to an extraction rule.
In this embodiment, the method further includes:
step 104: and if the judgment result is negative, not extracting.
In addition, when the official document is forwarded once or many times, the method for extracting the abstract of the official document based on pattern recognition provided by the embodiment can also be positioned in the originally forwarded official document to match the three sentence pattern rules through the preset three sentence pattern templates of the purpose of the official document, the basis of the official document and the main content. Fig. 2 is a flowchart of a document abstract extraction method based on pattern recognition according to embodiment 1 of the present invention.
The following are specific explanations of the purpose, basis and content of the text in this embodiment:
the purpose of the literary work is to refer to the primary purpose of writing a document, namely, the intent of a sending office to desire to achieve a certain purpose. In this sentence pattern, for example, by document: if the text is consistent with the sentence rule of the sentence, the official document is judged to contain the target content text, and the paragraph content containing the sentence rule of the sentence is extracted.
The basis of the travel of the subject is the basis of the travel of the subject. In this sentence pattern, for example, by document: if the text according with the sentence rule is conformed to the line text, the text containing the target content is judged, and the paragraph content containing the line text according with the sentence rule is extracted.
The main content refers to the content of each section of the document which is summarized, and generally, the main structure in the document. In this sentence pattern, for example, by document: "(one), (.. main content). (ii), (. major content). If the text conforming to the main content sentence pattern rule, the document is judged to contain the target content text and the paragraph content containing the main content sentence pattern rule is extracted.
Example 2
Fig. 3 is a system block diagram of a document summarization extraction system based on pattern recognition according to embodiment 2 of the present invention, and as shown in fig. 3, the system includes:
the text obtaining module 201 is configured to obtain a document text of the target content to be extracted.
The judging module 202 is configured to judge whether the official document text includes the target content according to the line mode, and obtain a judgment result.
And the extraction module 203 is used for extracting the target content from the official document text when the official document text contains the target content.
In this embodiment, the extraction module 203 includes an extraction unit, which is configured to extract the target content from the official document text according to an extraction rule.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
(1) the method has the advantages of high accuracy, strong pertinence and high practicability, is suitable for extracting important contents of documents, and can meet the requirement of workers on extracting important contents of texts.
(2) According to the method, a set of extraction rules and a set of extraction methods are established through sentence patterns and line patterns, the problem that the efficiency of extracting important contents of texts by workers is low is solved, the workers are helped to quickly and accurately extract the important contents from massive text information, the time for manually extracting text data is shortened, and the file searching and office text handling efficiency is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (7)

1. A method for abstracting official document based on pattern recognition is characterized by comprising the following steps:
acquiring a document text of target content to be extracted;
judging whether the official document text contains target content according to a literary style mode to obtain a judgment result;
and if the interpretation result is yes, extracting the target content from the official document text.
2. The method for abstracting official document based on pattern recognition of claim 1, wherein said line pattern is a different category of line sentence pattern rule obtained from historical official document structure and paragraphs.
3. The method for abstracting official document based on pattern recognition of claim 1 or 2, wherein the literary pattern comprises literary purposes, literary bases and literary contents.
4. The method for abstracting an official document abstract based on pattern recognition according to claim 1, wherein the abstracting the target content from the official document text specifically comprises: and extracting target content from the official document text according to an extraction rule.
5. The method for abstracting a brief summary based on pattern recognition of claim 1, wherein if the determination result is negative, then no abstraction is performed.
6. A system for abstracting official document based on pattern recognition is characterized by comprising:
the text acquisition module is used for acquiring the official document text of the target content to be extracted;
the judging module is used for judging whether the official document text contains target content according to the line mode to obtain a judging result;
and the extraction module is used for extracting the target content from the official document text when the official document text contains the target content.
7. The system for abstracting an official document based on pattern recognition as claimed in claim 6, wherein said abstraction module comprises an abstraction unit for abstracting target contents from the official document text according to an abstraction rule.
CN202011091166.0A 2020-10-13 2020-10-13 Mode recognition-based official document abstract extraction method and system Pending CN112183077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011091166.0A CN112183077A (en) 2020-10-13 2020-10-13 Mode recognition-based official document abstract extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011091166.0A CN112183077A (en) 2020-10-13 2020-10-13 Mode recognition-based official document abstract extraction method and system

Publications (1)

Publication Number Publication Date
CN112183077A true CN112183077A (en) 2021-01-05

Family

ID=73949554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011091166.0A Pending CN112183077A (en) 2020-10-13 2020-10-13 Mode recognition-based official document abstract extraction method and system

Country Status (1)

Country Link
CN (1) CN112183077A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
CN101692240A (en) * 2009-08-14 2010-04-07 北京中献电子技术开发中心 Rule-based method for patent abstract automatic extraction and keyword indexing
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises
CN109522402A (en) * 2018-10-22 2019-03-26 国家电网有限公司 A kind of abstract extraction method and storage medium based on power industry characteristic key words
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
CN101692240A (en) * 2009-08-14 2010-04-07 北京中献电子技术开发中心 Rule-based method for patent abstract automatic extraction and keyword indexing
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN109522402A (en) * 2018-10-22 2019-03-26 国家电网有限公司 A kind of abstract extraction method and storage medium based on power industry characteristic key words
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏海菊等: "中文科技文献文摘的自动编写", 《情报学报》 *

Similar Documents

Publication Publication Date Title
CN104408078B (en) A kind of bilingual Chinese-English parallel corpora base construction method based on keyword
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN103020230A (en) Semantic fuzzy matching method
US20180357207A1 (en) Evaluating documents with embedded mathematical expressions
Cao et al. Machine learning based detection of clickbait posts in social media
CN104978332A (en) UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN115186654B (en) Method for generating document abstract
Alsallal et al. Intrinsic plagiarism detection using latent semantic indexing and stylometry
CN114118053A (en) Contract information extraction method and device
CN111199151A (en) Data processing method and data processing device
CN101271448A (en) Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
Uddin et al. A study on text summarization techniques and implement few of them for Bangla language
CN117034327A (en) E-book content encryption protection method
Park et al. Automatic analysis of thematic structure in written English
CN104572628B (en) A kind of science based on syntactic feature defines automatic extraction system and method
CN112183077A (en) Mode recognition-based official document abstract extraction method and system
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Liu et al. Japanese named entity recognition for question answering system
Deshmukh et al. Sentiment analysis of Marathi language
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing
Jassem et al. Automatic summarization of polish news articles by sentence selection
Ba-Alwi et al. Arabic text summarization using latent semantic analysis
KR101240330B1 (en) System and method for mutidimensional document classification
Kumar et al. A comparative analysis of sarcasm detection
Yang et al. The construction of a kind of chat corpus in Chinese word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105

RJ01 Rejection of invention patent application after publication