CN113095061B

CN113095061B - Method, system, device and storage medium for extracting document header

Info

Publication number: CN113095061B
Application number: CN202110344640.4A
Authority: CN
Inventors: 蓝建敏; 李观春
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2023-08-29
Anticipated expiration: 2041-03-31
Also published as: CN113095061A

Abstract

The application discloses a method, a system, a device and a storage medium for extracting a document head based on a hidden Markov model; the extraction method comprises the steps of obtaining a document text, wherein the document text comprises a document title, a document genre and document content; extracting the document head of the document text by using the trained hidden Markov model; and obtaining the document head output by the trained hidden Markov model. According to the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised. The application can be widely applied to the technical field of document head extraction.

Description

Method, system, device and storage medium for extracting document header

Technical Field

The application relates to the technical field of document head extraction, in particular to a method, a system, a device and a storage medium for extracting document heads based on a hidden Markov model.

Background

The official document is a written material (document) of legal authorities and organizations in official activities, and a common official document format generally consists of attributes such as a part number, a secret level, a secret period, an emergency degree, an office mark, an office number, a signer, a title, a main delivery office, a text, an accessory description, an office signature, an office date, a seal, a remark, an accessory, a copying office, a printing date, a page number and the like, namely an official document head. The document header extraction refers to extracting an unstructured document into structured data according to the document attribute field by using an information extraction technology.

At present, technologies such as rule matching, position positioning and the like are generally adopted to extract document attributes such as titles, confidentiality deadlines and the like of documents from document files. However, the method for extracting the document header based on rule matching needs to rely on a large number of rules, and the manual arrangement of the rules has the disadvantages of large workload, narrow coverage of examples and high manual learning cost; however, in practice, many documents do not strictly follow the national document standard, resulting in low quality of the attributes of the extracted documents.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a method, a system, a device and a storage medium for extracting a document head based on a hidden Markov model.

The technical scheme adopted by the application is as follows:

in one aspect, an embodiment of the present application includes a method for extracting a document header based on a hidden markov model, including:

obtaining a document text, wherein the document text comprises a document title, a document genre and document content;

extracting the document head of the document text by using the trained hidden Markov model;

and obtaining the document head output by the trained hidden Markov model.

Further, the extraction method further includes:

and carrying out error correction processing on the extracted document header contents one by utilizing the trained document header correction model.

Further, the step of performing error correction processing on the extracted document heads one by using the trained document head correction model includes:

checking whether the content of the document header is wrong or not by using a trained document header correction model;

and if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.

Further, the step of extracting the document header from the document text by using the trained hidden markov model specifically includes:

acquiring corresponding brief head extraction attributes according to the genres of the brief texts;

taking the document text as observation input;

mapping the document text into words in a word set by using a maximum path similarity method;

outputting the mapped vocabulary sequence by using a Viterbi algorithm;

and obtaining the best state marking sequence from the vocabulary series and outputting the best state marking sequence.

Further, the best state marking sequence is a document header attribute sequence.

Further, the extraction method further comprises training the hidden Markov model, including:

constructing a training sample set;

inputting the training sample set into a hidden Markov model for training;

learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model;

and outputting a trained hidden Markov model according to the parameters.

Further, the step of constructing a training sample set specifically includes:

collecting document corpus of various genres and analyzing document header attribute;

determining the document head extraction attribute of documents of various genres;

and carrying out sequence labeling on the document head extraction attribute to obtain a training sample set.

On the other hand, the embodiment of the application also comprises a document head extraction system based on a hidden Markov model, which comprises:

the first acquisition module is used for acquiring a document text, wherein the document text comprises a document title, a document genre and document content;

the extraction module is used for extracting the document head of the document text by using the trained hidden Markov model;

and the second acquisition module acquires the document head output by the trained hidden Markov model.

On the other hand, the embodiment of the application also comprises a document head extraction device based on a hidden Markov model, which comprises:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the extraction method.

In another aspect, embodiments of the present application further include a computer readable storage medium having stored thereon a processor executable program for implementing the extraction method when executed by a processor.

The beneficial effects of the application are as follows:

according to the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flowchart illustrating a method for extracting a document header based on a hidden Markov model according to an embodiment of the present application;

FIG. 2 is a flow chart of the document header content correction according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for extracting a document header based on a hidden Markov model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a hidden Markov model-based document head extraction procedure according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a document head extraction device based on a hidden markov model according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.

In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

Embodiments of the present application will be further described below with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present application includes a method for extracting a document header based on a hidden markov model, including:

s1, acquiring a document text, wherein the document text comprises a document title, a document genre and document content;

s2, extracting a document header of the document text by using the trained hidden Markov model;

s3, acquiring the document head output by the trained hidden Markov model.

Regarding step S1, a text requiring to extract document header information is obtained, and the text content includes a document title, a document genre, and document content.

Further, after step S1, that is, after the step of obtaining the document text, the extracting method further includes:

s101, carrying out blank line feed, line feed and sentence breaking merging pretreatment on the document text.

In this embodiment, after preprocessing is completed, the preprocessed document text is input into a trained hidden markov model.

Specifically, step S2, that is, the step of extracting the document header from the document text by using the trained hidden markov model, specifically includes:

s201, acquiring corresponding document head extraction attributes according to the genre of the document text;

s202, taking the document text as observation input;

s203, mapping the document text into vocabulary in a vocabulary set by using a maximum path similarity method;

s204, outputting the mapped vocabulary sequence by using a Viterbi algorithm;

s205, acquiring the optimal state marking sequence from the vocabulary series and outputting the optimal state marking sequence.

In this embodiment, the trained hidden markov model is used to extract the document header from the inputted document text. Firstly, according to the document genre, obtaining the corresponding document head extraction attribute as a determined state; the input text is used as observation input, the maximum path similarity method is used for mapping the input text into vocabulary concentrated vocabulary, the Viterbi algorithm is used for outputting the mapped vocabulary sequence, and the optimal state marking sequence, namely the document header attribute sequence, is output.

Optionally, after the obtaining the document header output by the trained hidden markov model, the extracting method further includes:

s4, utilizing the trained document header correction model to perform error correction processing on the extracted document header content one by one.

In this embodiment, the error correction processing for the extracted header information one by using the trained header correction model specifically includes the following steps:

s401, checking whether the content of the document header is wrong or not by using a trained document header correction model;

s402, if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.

Referring to fig. 2, the document header content correction flow is as follows:

1) Inputting the content of the document header to be corrected;

and after the document header extraction model is applied to extract the document header information, the document header information contents are input one by one, and error correction and correction processing are carried out.

2) Applying a trained document head correction model to judge whether the document head correction model is error content;

and (3) applying a document header correction model to check whether the input content is wrong document header information.

3) Correcting the error content;

and if the content is the error content, calculating the similarity with the correct content by applying the cosine distance of a similarity algorithm through the verification in the last step, and obtaining the most similar correct content as the correction content.

For a plurality of different texts or short texts, how to calculate the similarity between the texts, one good way is to map words in the texts to a vector space to form a mapping relation between words in the texts and vector data, and calculate the similarity between the texts by calculating the difference of a plurality of or a plurality of different vectors. The cosine distance, also called cosine similarity, uses the cosine value of the angle between two vectors in the vector space as the measure of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e., the more similar the two vectors are, which is called "cosine similarity".

4) Outputting the correct document header content, namely outputting corrected text content.

In this embodiment, the extraction method further includes training a hidden markov model, including:

p1, constructing a training sample set;

p2, inputting the training sample set into a hidden Markov model for training;

p3, learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model;

and P4, outputting a trained hidden Markov model according to the parameters.

In this embodiment, the hidden markov model is mainly trained, and first, the hidden markov model is described, and the hidden markov model (Hidden Markov Model, HMM) is a probability model of a time series, describing an observation sequence composed of one markov chain composed of hidden state sequences and observations generated by each state thereof. The states in the Markov model are invisible or uncertain, only the state-generated observations can be directly observed, and the values of the states are determined by training the model parameters and decoding the model parameters by an identification algorithm to obtain a state sequence.

In a hidden Markov model, a state sequence is determined by a state transition probability distribution A and an initial state probability distribution pi, and an observation sequence is determined by a generated state sequence and an observation probability distribution. Each state emits an observation through an observation probability distribution B. The state transition probabilities a, B and pi of the initial probability distribution of states may be learned from training data.

The distribution of each document attribute to be extracted in the document text has a certain order relation, in the hidden Markov model, the document header attribute forms a state set, the value to be extracted is an observable text sequence, the text sequence is marked to form a training set, and the training set is learned by adopting an ML algorithm to obtain trained model parameters A, B and pi.

Step P1, namely the step of constructing a training sample set, specifically includes:

p101 collecting document corpus of various genres and analyzing document header attribute;

p102. determining the document head extraction attribute of the documents of various genres;

and P103, carrying out sequence labeling on the document head extraction attribute to obtain a training sample set.

In the embodiment, collecting document corpus of various genres, analyzing document header attributes, and determining various document header extraction attributes such as title, security level, confidentiality deadline, emergency degree, document issuing authority mark and the like; and then, sequence labeling is carried out on the document head attributes of a certain scale, so that a training sample set can be obtained.

Specifically, the hidden markov model-based document head extraction method may refer to fig. 3, where the initial states in fig. 3 include states of title, security class, security deadline, urgency, issuing authority flag, and the like. The training sample is input into the hidden Markov model after pretreatment to be a training part for the hidden Markov model; and inputting the text to be extracted into a trained hidden Markov model for document head extraction after preprocessing, and finally outputting an optimal state marking sequence, wherein the optimal state marking sequence is a document head attribute sequence, and extracting an unstructured document into structured data according to a document head attribute field by using the document head extraction method based on the hidden Markov model.

Referring to fig. 4, the present embodiment further provides a document head extraction program based on a hidden markov model, for implementing the document head extraction method shown in fig. 1, where the program specifically includes a labeling module, a learning and training module, a model library, a text preprocessing module, and a document head extraction module, where the labeling module is used to label attributes of a document corpus; the learning and training module is used for learning and training the marked corpus to obtain a document head extraction model, namely a hidden Markov model; the model library is used for storing the model obtained by training; the text preprocessing module is used for carrying out blank line feed and line break sentence breaking merging processing on the input document text; and the document header extraction module is used for extracting document header attributes of the preprocessed text.

The document head extraction method based on the hidden Markov model has the following technical effects:

according to the embodiment of the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised.

Referring to fig. 5, the embodiment of the present application further includes a document head extraction device 200 based on a hidden markov model, which specifically includes:

at least one processor 210;

at least one memory 220 for storing at least one program;

the at least one program, when executed by the at least one processor 210, causes the at least one processor 210 to implement the method as shown in fig. 1.

The memory 220 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs. Memory 220 may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 220 may optionally include remote memory located remotely from processor 210, which may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be appreciated that the device structure shown in fig. 5 is not limiting of the device 200 and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

In the apparatus 200 shown in fig. 5, the processor 210 may retrieve the program stored in the memory 220 and perform, but is not limited to, the steps of the embodiment shown in fig. 1.

The above-described embodiment of the apparatus 200 is merely illustrative, in which the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiment.

The embodiment of the present application also provides a computer-readable storage medium storing a processor-executable program for implementing the method shown in fig. 1 when executed by a processor.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

It is to be understood that all or some of the steps, systems, and methods disclosed above may be implemented in software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

The embodiments of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application.

Claims

1. The document head extraction method based on the hidden Markov model is characterized by comprising the following steps of:

obtaining a document head output by the trained hidden Markov model;

the method for extracting the document header of the document text by using the trained hidden Markov model comprises the following steps:

taking the document text as observation input;

outputting the mapped vocabulary sequence by using a Viterbi algorithm;

obtaining and outputting an optimal state marking sequence from the vocabulary series, wherein the optimal state marking sequence is a document header attribute sequence;

the extraction method further comprises training the hidden Markov model, and comprises the following steps:

constructing a training sample set;

inputting the training sample set into a hidden Markov model for training;

outputting a trained hidden Markov model according to the parameters;

wherein the constructing a training sample set includes:

determining the document head extraction attribute of documents of various genres, including title, security level, security period, emergency degree and document issuing authority mark;

performing sequence labeling on the document head extraction attribute to obtain a training sample set;

the extraction method further comprises the step of performing error correction processing on extracted document header contents one by using a trained document header correction model, and comprises the following steps:

2. A hidden markov model-based document head extraction system comprising:

the second acquisition module acquires a document head output by the trained hidden Markov model;

the implementation process of the extraction module comprises the following steps:

taking the document text as observation input;

outputting the mapped vocabulary sequence by using a Viterbi algorithm;

the extraction system further comprises a training module for: constructing a training sample set; inputting the training sample set into a hidden Markov model for training; learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model; outputting a trained hidden Markov model according to the parameters;

wherein the constructing a training sample set includes:

the extraction system also comprises a module for carrying out error correction processing on extracted document header contents one by utilizing a trained document header correction model, and the implementation process of the module comprises the following steps: checking whether the content of the document header is wrong or not by using a trained document header correction model; and if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.

3. A hidden markov model-based document head extraction device, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the decimation method as claimed in claim 1.

4. Computer readable storage medium, characterized in that it has stored thereon a processor executable program for implementing the extraction method according to claim 1 when being executed by a processor.