CN112686012A

CN112686012A - Document feature extraction method, device, equipment and medium

Info

Publication number: CN112686012A
Application number: CN202011253863.1A
Authority: CN
Inventors: 黄敬林; 庄莉; 梁懿; 林振天; 池少宁; 翁明东
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-04-20
Anticipated expiration: 2040-11-11
Also published as: CN112686012B

Abstract

The invention discloses a method for extracting official document features, which comprises the following steps: a document extraction template definition process and a document feature extraction process; customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; then extracting an identification tag according to the attachment in the official document extraction template, acquiring the official document text or the official document text and the attachment as an official document to be extracted, and converting the official document to be extracted into the content in the format of the extensible markup language; carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag; and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels. According to the method, the device, the equipment and the medium for extracting the document features, the document feature extraction template is defined, the feature extraction of the unstructured document is realized in a building block mode, and the difficulty of document feature extraction is greatly simplified.

Description

Document feature extraction method, device, equipment and medium

Technical Field

The invention relates to the technical field of official document management, in particular to an official document feature extraction method, device, equipment and medium.

Background

The official documents are written materials which are formed and used by legal authorities and organizations in official business activities according to specific body types and through certain processing procedures, and are also called official documents. Whether professional work or administrative affairs are performed, the user learns to transmit political policies and deal with the affairs through official documents so as to ensure various relationships to be coordinated and decide that the affairs can be performed correctly and efficiently. The method for extracting the document features is a main means for deeply analyzing document contents, and along with continuous deepening of relevant technologies such as artificial intelligence, natural language processing, text mining and the like, the method for extracting the document features is also rich.

At present, the official document feature extraction system is mainly limited to combing existing metadata to form a complete metadata related standard, and a system for extracting features of an official document unstructured file is not available. The existing method for extracting the document features mainly forms document feature information by analyzing the existing metadata and extracting the content of the unstructured document paragraphs, and is mainly realized by related technologies such as keyword extraction and word segmentation.

The existing official document feature extraction system mainly has the characteristics of complex codes, low reusability and the like. The features need to be extracted from different document writing codes, and a large amount of personnel investment needs to be consumed. The extracted characteristic information is difficult to be presented in an imaging way, and codes are difficult to debug and modify, so that great difficulty and difficult-to-break business points are brought to the intelligent application of the document.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method, a device, equipment and a medium for extracting the characteristics of an official document, which realize the characteristic extraction of an unstructured official document in a building block mode by defining an official document characteristic extraction template and greatly simplify the difficulty of the characteristic extraction of the official document.

In a first aspect, the present invention provides a method for extracting document features, including: a document extraction template definition process and a document feature extraction process;

the official document extraction template definition process comprises the following steps:

customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;

the official document feature extraction process comprises the following steps:

extracting an identification tag according to an attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into a content in an extensible markup language format;

carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;

and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.

Further, the extracting feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;

the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;

the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;

the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.

Further, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.

In a second aspect, an apparatus for extracting a document feature includes: the document extraction template definition module and the document feature extraction module;

the official document extraction template definition module is used for customizing the official document extraction template through an extensible markup language, and the official document extraction template comprises: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;

the document feature extraction module is used for extracting the identification tag according to the attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into the content in the format of the extensible markup language;

In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

1. the expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized;

2. by setting an accessory extraction identification tag, a split sentence rule tag and at least one extracted feature field tag, the feature extraction of the unstructured official document is realized in a building block mode, and the difficulty of the official document feature extraction is greatly simplified;

3. the extraction of the characteristics of the machine replaces the extraction of the artificial characteristics, so that the characteristic extraction quality is improved, the threshold of the characteristic extraction of the system is greatly reduced, the efficiency of the extraction of the characteristics of the document is effectively improved, and a solid foundation is laid for the intelligent application of the document.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating an official document extraction template definition process according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a document feature extraction process according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the invention;

fig. 6 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.

Detailed Description

Example one

The embodiment provides a document feature extraction method, as shown in fig. 1, including: a document extraction template definition process and a document feature extraction process;

the official document feature extraction process comprises the following steps:

The expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized; through setting up annex extraction sign label, split sentence rule label and at least one extraction characteristic field label to the mode of building blocks realizes the characteristic extraction to unstructured official document, simplifies the degree of difficulty to official document characteristic extraction greatly.

In one possible implementation, as shown in fig. 2, the extracting feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;

The attachment extraction identification tag can be defined by the format of < attachment num ═ 0"interval ═ 10"/>;

split sentence rule tags may be defined by the < sensens regular ═ positive expression "/> format;

the extracted features field tag by sentence may be defined by the format of < regular name ═ x1, x2,. - > regular top-subject ═ x'/>;

the packet extraction feature field tag may be defined by a regular rg-group type, a regular name, a regular no-checker-regular, a regular next-regular id, a regular top-paragraph, an x-unit/type format;

the extraction tag split by paragraph may be defined by the format of < regular flag-regular end-regular ═ regular "name ═ sOrgs" regular ═ all "next-regular-id ═ c-fs-unit" merge ═ true "include-start ═ true" include-end ═ true "/> format.

In a possible implementation manner, as shown in fig. 3, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.

When the extracted feature field label is a group extracted feature field label nested with a rule for extracting the feature field according to sentences or a split extracted label according to paragraphs, the sentence extracted feature field feature rules in the extracted feature field label are called in sequence to extract features, and the results are sorted and output.

Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, which is detailed in the second embodiment.

Example two

In this embodiment, there is provided a document feature extraction device, as shown in fig. 4, including: the document extraction template definition module and the document feature extraction module;

In one possible implementation, the extracting the feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;

In a possible implementation manner, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method of the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus, and thus the details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.

EXAMPLE III

The embodiment provides an electronic device, as shown in fig. 5, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, any one of the first embodiment modes may be implemented.

Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the methods in the embodiments of the present application is within the scope of the present application.

Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.

Example four

The present embodiment provides a computer-readable storage medium, as shown in fig. 6, on which a computer program is stored, and when the computer program is executed by a processor, any one of the embodiments can be implemented.

According to the technical scheme provided by the embodiment of the application, the expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized; by setting an accessory extraction identification tag, a split sentence rule tag and at least one extracted feature field tag, the feature extraction of the unstructured official document is realized in a building block mode, and the difficulty of the official document feature extraction is greatly simplified; the extraction of the characteristics of the machine replaces the extraction of the artificial characteristics, so that the characteristic extraction quality is improved, the threshold of the characteristic extraction of the system is greatly reduced, the efficiency of the extraction of the characteristics of the document is effectively improved, and a solid foundation is laid for the intelligent application of the document.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A method for extracting document features is characterized in that: the method comprises the following steps: a document extraction template definition process and a document feature extraction process;

the official document feature extraction process comprises the following steps:

2. The method of claim 1, wherein: the extracted feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;

3. The method according to claim 1 or 2, characterized in that: the extracting and outputting the characteristic field sentence by sentence according to the extracted characteristic field label specifically comprises: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.

4. The utility model provides a official document feature extraction device which characterized in that: the method comprises the following steps: the document extraction template definition module and the document feature extraction module;

5. The apparatus of claim 4, wherein: the extracted feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;

6. The apparatus of claim 4 or 5, wherein: the extracting and outputting the characteristic field sentence by sentence according to the extracted characteristic field label specifically comprises: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the program.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.