CN112668316A

CN112668316A - word document key information extraction method

Info

Publication number: CN112668316A
Application number: CN202011290565.XA
Authority: CN
Inventors: 张丽; 董雨辰; 张翔宇; 杜慧; 解峥; 钟习; 陈志鹏; 俞晓明; 刘悦
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-04-16

Abstract

The invention discloses a word document key information extraction method, which comprises the following steps: step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, if so, entering a step two, otherwise, entering a step three; step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file; and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph and inputting the paragraph into the region corresponding to the information category. The invention utilizes the template style attribute information in the word document, thereby greatly improving the efficiency of extracting the key information from the word document.

Description

word document key information extraction method

Technical Field

The invention relates to the technical field of information content processing. More specifically, the invention relates to a method for extracting key information of a word document.

Background

The existing method for extracting the key information of the MS Word document mainly comprises the steps of compiling a specific program by a programmer for extraction, wherein the specific differences of various methods are large, and a fixed standard does not exist. The existing key information extraction cannot effectively extract paragraphs with styles in MS Word documents; the customizability of the prior art is poor, and a user can not select which types of key information to extract many times; for the section without the pattern, no effective extraction scheme exists; and the output of the extracted file is not standard.

Disclosure of Invention

An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.

The invention also aims to provide a method for extracting the key information of the word document, which classifies the paragraphs of the word document according to whether the paragraphs have the template style attributes by utilizing the template style attribute information of the paragraphs of the word document, adopts different key information extraction methods for the paragraphs of different types, and greatly improves the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting files with uniform format, so that the result of the program is clearer.

To achieve these objects and other advantages in accordance with the present invention, there is provided a word document key information extraction method, comprising:

step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, and entering step two if the paragraph has the template style attribute; if the template does not have the template style attribute, entering a third step;

step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file;

and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph, and inputting the paragraph into an area corresponding to the information category in the first output file.

Preferably, in the word document key information extraction method, the preset list of key information categories to be extracted at least includes a title, a text, a table and other categories.

Preferably, in the method for extracting the key information of the word document, in the third step, the information category of the paragraph identified based on the preset neural network model specifically includes: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;

wherein, M ═ M₁、m₂、…m_n]Wherein m represents a format attribute；

The neural network model comprises three fully-connected layers, wherein the output dimensionality of the first fully-connected layer is 50; the output dimension of the second fully-connected layer is 20, and the output dimension of the third fully-connected layer is n; n is equal to the number of categories in the preset key information category information to be extracted.

Preferably, in the word document key information extraction method, the format attribute includes at least one of a font size, a font style, a text length, a segment spacing, whether to be darkened, whether to be italicized, and the like.

Preferably, the method for extracting the key information of the word document further comprises a fourth step of performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two.

Preferably, in the word document key information extraction method, the first file is in a json format.

Preferably, the word document key information extraction method includes the following specific steps: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;

the method comprises the steps that a file is a word document or a file folder, and if the file is the word document, the word document is obtained and all paragraphs in the word document are traversed; and if the file is a folder, starting a plurality of threads, wherein one thread correspondingly acquires at least one word document in the folder and traverses all paragraphs in the word document.

Preferably, in the method for extracting key information of a word document, the configuration file further includes a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted.

The invention also provides a word document key information extraction device, which comprises:

a processor;

a memory storing executable instructions;

the processor is configured to execute the executable instructions to execute the word document key information extraction method.

The invention at least comprises the following beneficial effects:

1. the invention uses the template style attribute information of word document paragraphs to classify the paragraphs of the word document according to whether the paragraphs have the template style attribute, and adopts different key information extraction methods for different types of paragraphs, thereby greatly improving the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting the file with the uniform format, so that the result of the program is clearer;

2. according to the method, the content of the key information category to be extracted is prestored by using the configuration file, the program reads the configuration information from the configuration file and extracts the target word document, so that the flexibility and the customizability of the program are improved;

3. for the non-style paragraphs, the information categories of the non-style paragraphs are calculated and identified by adopting a preset neural network model, and then targeted extraction is performed, so that the processing efficiency of the non-style paragraphs is greatly improved, and the extraction efficiency of the key information of the word document is further achieved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a flowchart of a word document key information extraction method according to the present invention.

Detailed Description

The present invention is further described in detail below with reference to the drawings and examples so that those skilled in the art can practice the invention with reference to the description.

It will be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.

In the description of the present invention, the terms "lateral", "longitudinal", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

The invention provides a word document key information extraction method, which comprises the following steps:

In the technical scheme, the method and the device utilize the template style attribute information of the word document paragraphs, classify the paragraphs of the word document according to whether the paragraphs have the template style attribute, adopt different key information extraction methods for different types of paragraphs, and greatly improve the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting files with uniform format, so that the result of the program is clearer.

In another technical scheme, in the method for extracting the key information of the word document, a preset list of key information categories to be extracted at least comprises a title, a text, a table and other categories. The method comprises the steps of extracting three key information of a title, a text and a table in a document, and summarizing the key information in a first file according to categories so that a client can quickly obtain main content of the document and master key and important information of the document.

In another technical scheme, in the method for extracting the key information of the word document, the step three is specifically that the information category of the paragraph identified based on the preset neural network model is as follows: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;

wherein, M ═ M₁、m₂、…m_n]Wherein m represents a format attribute;

According to the method, for the style-free paragraphs, the preset neural network model is adopted to calculate and identify the information types of the style-free paragraphs, and then targeted extraction is performed, so that the processing efficiency of the style-free paragraphs is greatly improved, and the extraction efficiency of the key information of the word document is further achieved.

Preprocessing the paragraph to obtain a feature vector corresponding to the paragraph, taking the feature vector as an input value of a neural network model, outputting a result after calculation of the neural network model, and determining the information category of the paragraph according to the output result; the neural network has better generalization performance and can quickly and accurately extract each key information in the document;

the invention adopts a neural network model with a 3-layer full-connection layer structure:

the input dimension of the first full-connection layer is 100, the output dimension is 50, the input feature dimension is 100, and the layer is used for extracting original features to obtain the features of the hidden layer;

the input dimension of the second full-connection layer is 50, the output dimension is 20, the hidden layer feature is processed by using the layer, namely the hidden layer feature is multiplied by a W matrix, and the dimension of the hidden layer feature is changed;

the input dimension of the third fully-connected layer is 20, the specific numerical value of the output dimension of the third fully-connected layer is determined according to the number of the key information categories to be extracted, and the arrangement order of the numerical values of the output dimension is the same as that of the key information categories to be extracted, for example, the key information categories to be extracted are three types of titles, texts and tables, the output dimension of the third fully-connected layer is 3, and softmax is used to change the 3 values into three probability values as the output result of the neural network model, for example, the final output result of a certain paragraph is (0.2, 0.1, 0.7), the sum of the three probability values is 1, wherein 0.2 indicates that the paragraph is a title with a probability of 0.2, 0.1 indicates that the paragraph is a body with a probability of 0.1, 0.7 indicates that the paragraph is a table with a probability of 0.7, the information type of the paragraph can be determined to be a table, and the paragraph can be extracted and input to the area stored in the table in the first document.

Due to the classification problem, a cross-entropy loss function is selected, the loss function is used to calculate the difference between the output value of the third fully-connected layer and the true class, an error is obtained, and then parameters of the third fully-connected layer are optimized using back propagation and gradient descent. The parameters of the second fully-connected layer and the first fully-connected layer are optimized using the chain rule.

The neural network model with the structure is adopted to extract the document, the effects of the verification set and the test set are better, the neural network model is simple in structure, fast in operation and small in error, and the extraction of the key information of the document can be accurately and fast completed.

In another technical scheme, the format attribute includes at least one of a font size, a font style, a text length, a segment interval, whether to be blackened, whether to be bolded, whether to be italicized, and the like. Through feature processing of the format attribute, each style-free paragraph can be represented by a vector with a fixed dimension, each dimension of the vector represents a feature of the paragraph, and the feature may be discrete or continuous. For example, a font is a discrete feature, 1 for sones, 2 for bold, 3 for clerks, and so on. The line spacing is a continuous feature, and the value of the feature is the value of the line spacing. If the font of a paragraph is song font, the font size is second font, bold, underline, italic, etc. are used, the paragraph is processed according to the format attribute to obtain vector numerical expression of [1, 2, 1, 0, 0, x ].

In another technical scheme, the method for extracting the key information of the word document further comprises the fourth step of carrying out format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two. After the extraction of the key information of the document is finished, format processing is carried out on the document according to preset format attributes (template style attribute characteristics), namely, titles, texts, tables and the like of the document are typeset according to a unified and standard format and are output as a second file, so that the later management and the lookup of the document by a client are facilitated.

In another technical scheme, in the word document key information extraction method, the first file is in a json format. The Json format file is convenient for data transmission and analysis.

In another technical scheme, the method for extracting the key information of the word document comprises the following steps of: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;

A file name field to be processed, wherein the file storage path field to be processed is first type configuration information in a configuration file, file _ to _ extract; for example, if a file of the computer F disk needs to be processed, "file _ to _ extract" is filled in the configuration file: "F/data/"; before document extraction, a program firstly reads a configuration file, reads and analyzes a 'file _ to _ extract' field, and obtains a document to be processed according to a file storage path and a file name;

the invention can process not only a single document, but also a folder in which a plurality of documents are stored, if the document is a single document, the program directly obtains the document and traverses all paragraphs of the document; if the document is a folder, starting a plurality of threads, wherein one thread correspondingly processes at least one document in the folder, any thread performs recursive processing extraction on the corresponding at least one document, and for the processing thread of any document, traversing all paragraphs of the corresponding document, finally merging the extraction results of all documents in the folder and outputting the merged extraction results through one file.

In another technical scheme, in the method for extracting the key information of the word document, the configuration file further comprises a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted. The key information category field to be extracted is second configuration file information, "class _ to _ extract"; for example, extracting the title, text and table key information in the document, "class _ to _ extract" may be written in the configuration file: [0, 1, 2, 3, 4, 5], wherein 0 represents a title, 1 represents a body, 2 represents a table, 3 represents a title in a style-free paragraph, 4 represents a body in a style-free paragraph, and 5 represents a table in a style-free paragraph; the key information to be extracted is pre-stored in the configuration file, and the program reads the information from the configuration file, sets the information and then extracts the information, so that the flexibility and the customizability of the program are improved.

The documents listed in the present invention are all represented as word documents.

a processor;

a memory storing executable instructions;

The technical scheme is obtained based on the same inventive concept as the word document key information extraction method, and reference can be made to the description of the method part. The device of the technical scheme is not limited to the pc, the terminal and the server. The device can be arranged in the server to acquire the file and extract the key information of the file.

The following is a specific example: extracting three key information of a title, a text and a table of a certain file in a computer;

as shown in fig. 1, the method for extracting the key information of the word document comprises the following steps:

step 100, filling in a configuration file: { "file _ to _ extract": "F/data/", "class _ to _ extract": [0, 1, 2, 3, 4, 5] };

file _ to _ extract: the field is the name of the file or folder to be extracted;

class _ to _ extract: this field is a list of categories of key information to be extracted. Where 0 represents a title, 1 represents a body, 2 represents a table, 3 represents a title in a style-free paragraph, 4 represents a body in a style-free paragraph, and 5 represents a table in a style-free paragraph.

Step 200, running a program, reading a configuration file and analyzing: firstly, analyzing a file name, acquiring a file _ to _ extract field filled by a user, analyzing and acquiring a file to be processed, judging whether the file is a single file or a folder, if the file is a file, directly acquiring the file by a program, if the file is a folder containing a plurality of files, starting a plurality of threads, wherein one thread corresponds to at least one file in the folder, any thread performs recursive processing extraction on the corresponding at least one file, traversing all paragraphs of the corresponding file for the processing thread of any file, and finally merging extraction results of all files in the folder and outputting the extraction results through one file;

then, the class _ to _ extract field is read and the category of the key information to be extracted is analyzed, and the program sets the category of the key information to be extracted to form a list of the category of the key information to be extracted, [0, 1, 2, 3, 4, 5 ].

Step 300, judging whether the acquired document has a template style attribute, and if so, entering step 301; if the template style attribute does not exist, go to step 302;

step 301, obtaining a paragraph information category according to the template style attribute of the paragraph, matching the paragraph information category with a preset key information category list to be extracted, and judging whether the paragraph belongs to one of the categories in the key information category list to be extracted, if the paragraph belongs to the region corresponding to the information category to which the paragraph is extracted and input into the first output file, otherwise, not extracting the paragraph;

step 302, preprocessing the paragraphs without template style attributes according to a preset format attribute rule, extracting to obtain a feature vector M, taking the feature vector M as an input value of a preset neural network model, and obtaining an output result of the neural network model, [ P ]₁、P₂、…P_n]The information category corresponding to the P with the largest numerical value is the information category to which the paragraph belongs, the information category is matched with a preset key information category list to be extracted, whether the paragraph belongs to one of the categories in the key information category list to be extracted is judged, if the paragraph belongs to the region corresponding to the information category to which the paragraph belongs, the paragraph is extracted and input into the first output file, and if not, the paragraph is not extracted;

and step 400, performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file II.

The number of apparatuses and the scale of the process described herein are intended to simplify the description of the present invention. Applications, modifications and variations of the present invention will be apparent to those skilled in the art.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

The method for extracting the key information of the word document is characterized by comprising the following steps:

step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, and entering step two if the paragraph has the template style attribute; if the template does not have the template style attribute, entering a third step;

step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file;

and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph, and inputting the paragraph into an area corresponding to the information category in the first output file.
2. The word document key information extraction method according to claim 1, wherein the preset key information category list to be extracted at least includes categories such as a title, a text, a table and the like.
3. The word document key information extraction method according to claim 2, wherein in step three, the information category of the paragraph identified based on the preset neural network model specifically is: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;

wherein, M ═ M₁、m₂、…m_n]Wherein m represents a format attribute;

the neural network model comprises three fully-connected layers, wherein the output dimensionality of the first fully-connected layer is 50; the output dimension of the second fully-connected layer is 20, and the output dimension of the third fully-connected layer is n; n is equal to the number of categories in the preset key information category information to be extracted.
4. The word document key information extraction method of claim 3, wherein the format attribute includes at least one of a font size, a font style, a text length, a segment spacing, whether to darken, whether to italics, and the like.
5. The word document key information extraction method of claim 4, further comprising a fourth step of performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two.
6. The word document key information extraction method of claim 5, wherein the first file is in a json format.
7. The method for extracting word document key information as claimed in claim 6, wherein the step one of obtaining the word document specifically comprises: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;

the method comprises the steps that a file is a word document or a file folder, and if the file is the word document, the word document is obtained and all paragraphs in the word document are traversed; and if the file is a folder, starting a plurality of threads, wherein one thread correspondingly acquires at least one word document in the folder and traverses all paragraphs in the word document.
8. The word document key information extraction method of claim 7, wherein the configuration file further includes a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted.
The word document key information extraction device is characterized by comprising:

a processor;

a memory storing executable instructions;

wherein the processor is configured to execute the executable instructions to execute the method for extracting the key information of the word document according to any one of claims 1 to 8.