CN113486651A - Method and device for extracting official document relation - Google Patents

Method and device for extracting official document relation Download PDF

Info

Publication number
CN113486651A
CN113486651A CN202110756360.4A CN202110756360A CN113486651A CN 113486651 A CN113486651 A CN 113486651A CN 202110756360 A CN202110756360 A CN 202110756360A CN 113486651 A CN113486651 A CN 113486651A
Authority
CN
China
Prior art keywords
official document
entity
sequence
relation
text file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110756360.4A
Other languages
Chinese (zh)
Inventor
聂砂
刘海
贾国琛
罗奕康
崔震
戴菀庭
师文宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110756360.4A priority Critical patent/CN113486651A/en
Publication of CN113486651A publication Critical patent/CN113486651A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a method and a device for extracting official document relations, and relates to the technical field of artificial intelligence. One embodiment of the method comprises: searching at least one official document entity appearing in the original text file, and screening the official document entity needing to extract the official document relation from the at least one official document entity as a target official document entity according to a set screening rule; replacing a target official document entity in the original text file by using the set first character string to obtain a new text file; inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence; and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type. According to the embodiment, the entity type corresponding to the official document entity is identified through the sequence marking model, so that the official document relation is determined, the length of the text file is shortened before identification, and the model identification effect is guaranteed.

Description

Method and device for extracting official document relation
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for extracting official document relations.
Background
A document is a body of law, effectiveness and regulation in the public business of a legal organization or other social organization. An official document relationship refers to the relationship between the text file itself and an official document appearing in the text file. Such as a text file: according to the regulation A published in 2020, the related department revises part of the provisions of the scheme B. Extracting the official document relationship is to extract the relationship between the text document and the regulations A and the schemes B.
In the prior art, when extracting a document relation, the relation between a text file and the document is usually determined by matching words before and after a document title. For example, by matching "from" before "rule A", it can be determined that "from relation" is between "rule A" and the text document.
In the process of implementing the invention, the prior art at least has the following problems:
the official document relation is extracted through a matching mode, matched keywords need to be listed, such as 'according', but other keywords may appear in a text file, such as 'according', '… …', and the like, so that all the keywords cannot be listed completely, and therefore the mode cannot be matched to all the official document relations correctly, and accuracy is low. Moreover, a plurality of documents may appear in one text document, and the matching method needs to match each document, so that all document relations cannot be extracted at one time.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for extracting an official document relationship, where the method converts an official document relationship extraction problem into a named entity identification problem, uses an entity type to represent a relationship between a text file and an official document entity, identifies which entity type corresponds to the official document entity appearing in the text file through a sequence tagging model, and processes the text file before identification, so as to reduce the length of the text file, highlight relatively important information, and ensure the identification effect of the sequence tagging model.
To achieve the above object, according to an aspect of an embodiment of the present invention, a method for extracting a document relation is provided.
The method for extracting the official document relation comprises the following steps: searching at least one official document entity appearing in the original text file, and screening the official document entity needing to extract the official document relation from the at least one official document entity as a target official document entity according to a set screening rule; replacing a target official document entity in the original text file by using a set first character string to obtain a new text file; wherein the length of the first character string is smaller than the text length of the target official document entity; inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence; wherein the label comprises an entity type of the official document entity defined according to the category of the official document relation; and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type.
Optionally, the screening rule is: the official document entity is related to policy issuing and making; the screening of the official document entities needing to extract the official document relationship from the at least one official document entity as target official document entities comprises the following steps: inputting the at least one official document entity into a pre-trained text classification model, judging whether the at least one official document entity needs to extract an official document relation or not by the text classification model, and outputting a classification result; and taking the official document entity needing to extract the official document relation in the classification result as a target official document entity.
Optionally, the inputting the new text file into a pre-trained sequence labeling model includes: extracting a sentence containing the first character string from the new text file, and inputting the sentence into a pre-trained sequence labeling model; the labeling the characters in the new text file by the sequence labeling model, and outputting a label sequence, including: and labeling the characters in the sentence by the sequence labeling model, and outputting a label sequence.
Optionally, the tagging, by the sequence tagging model, characters in the sentence, and outputting a tag sequence include: coding characters in the sentence by using the sequence labeling model to obtain word vectors corresponding to the characters; inputting the word vectors into a full-connection layer for dimensionality reduction, and fitting dimensionality reduction results with set labels to obtain the probability that the dimensionality reduction results belong to the labels; and determining the label of the character according to the probability, generating a label sequence and outputting the label sequence.
Optionally, the generating a tag sequence includes: determining an identifier corresponding to the label according to the corresponding relation between the label and a set identifier; and forming the identifiers into a label sequence according to the sequence of the characters in the sentence.
Optionally, the method further comprises: labeling the statement samples in the first training set to obtain labels corresponding to characters in the statement samples; and inputting the labeled sentence sample into a pre-training semantic model to obtain a word vector corresponding to characters in the labeled sentence sample, and processing the word vector by using a full-link layer and an activation function to obtain the sequence labeling model.
Optionally, the labeling the sentence samples in the first training set includes: and labeling the sentence samples in the first training set by using a setting identifier.
Optionally, the tags include an unrelated tag, a start tag corresponding to the entity type, and an end tag.
Optionally, the method further comprises: labeling the official document entity samples of the second training set to obtain classification results corresponding to the official document entity samples; the classification result comprises a required document relation extraction and a non-required document relation extraction; and inputting the marked official document entity sample into a pre-training semantic model, training the pre-training semantic model, and obtaining the text classification model.
Optionally, the method further comprises: and replacing the official document entity which does not need to extract the official document relation in the original text file by using the set second character string.
Optionally, the searching for at least one document entity appearing from the original text file includes: and searching a sub-text containing the designated punctuation marks from the original text file, and taking the part of the sub-text except the punctuation marks as an official document entity.
Optionally, the method further comprises: and defining the entity type of the official document entity according to the type of the official document relation.
Optionally, the category of the official document relation includes one or more of: according to, abolish, revise, mention, reply, implement, forward and print; the entity types include one or more of: according to the official document entity, the disuse official document entity, the revision official document entity, the mention official document entity, the reply official document entity, the implementation official document entity, the forwarding official document entity and the printing official document entity.
To achieve the above object, according to another aspect of the embodiments of the present invention, an apparatus for extracting a document relation is provided.
An apparatus for extracting a document relation according to an embodiment of the present invention includes: the screening module is used for searching at least one document entity appearing in the original text file and screening the document entity needing to extract the document relation from the at least one document entity as a target document entity according to a set screening rule; the replacing module is used for replacing a target official document entity in the original text file by using a set first character string to obtain a new text file; wherein the length of the first character string is smaller than the text length of the target official document entity; the marking module is used for inputting the new text file into a pre-trained sequence marking model, marking characters in the new text file by the sequence marking model and outputting a label sequence; wherein the label comprises an entity type of the official document entity defined according to the category of the official document relation; and the determining module is used for determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type.
Optionally, the screening rule is: the official document entity is related to policy issuing and making; the screening module is further used for inputting the at least one official document entity into a pre-trained text classification model, judging whether the official document entity needs to extract an official document relation or not by the text classification model, and outputting a classification result; and taking the official document entity needing to extract the official document relation in the classification result as a target official document entity.
Optionally, the marking module is further configured to extract a sentence including the first character string from the new text file, and input the sentence into a pre-trained sequence marking model; and labeling the characters in the sentence by the sequence labeling model, and outputting a label sequence.
Optionally, the marking module is further configured to encode the characters in the sentence by using the sequence marking model to obtain word vectors corresponding to the characters; inputting the word vectors into a full-connection layer for dimensionality reduction, and fitting dimensionality reduction results with set labels to obtain the probability that the dimensionality reduction results belong to the labels; and determining the label of the character according to the probability, generating a label sequence and outputting the label sequence.
Optionally, the marking module is further configured to determine an identifier corresponding to the tag according to a correspondence between the tag and a set identifier; and forming the identifiers into a label sequence according to the sequence of the characters in the sentence.
Optionally, the apparatus further comprises: the first model training module is used for labeling the sentence samples in the first training set to obtain labels corresponding to characters in the sentence samples; and inputting the labeled sentence sample into a pre-training semantic model to obtain a word vector corresponding to characters in the labeled sentence sample, and processing the word vector by using a full-link layer and an activation function to obtain the sequence labeling model.
Optionally, the first model training module is further configured to label the sentence samples in the first training set by using a set identifier.
Optionally, the tags include an unrelated tag, a start tag corresponding to the entity type, and an end tag.
Optionally, the apparatus further comprises: the second model training module is used for labeling the official document entity samples of the second training set to obtain classification results corresponding to the official document entity samples; the classification result comprises a required document relation extraction and a non-required document relation extraction; and inputting the marked official document entity sample into a pre-training semantic model, training the pre-training semantic model, and obtaining the text classification model.
Optionally, the apparatus further comprises: and the processing module is used for replacing the official document entity which does not need to extract the official document relation in the original text file by using the set second character string.
Optionally, the screening module is further configured to search a sub-text containing a specified punctuation mark from the original text file, and use a part of the sub-text except the punctuation mark as an official document entity.
Optionally, the apparatus further comprises: and the definition module is used for defining the entity type of the official document entity according to the type of the official document relation.
Optionally, the category of the official document relation includes one or more of: according to, abolish, revise, mention, reply, implement, forward and print; the entity types include one or more of: according to the official document entity, the disuse official document entity, the revision official document entity, the mention official document entity, the reply official document entity, the implementation official document entity, the forwarding official document entity and the printing official document entity.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic apparatus.
An electronic device of an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the official document relation extraction method of the embodiment of the invention.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium.
A computer-readable medium of an embodiment of the present invention stores thereon a computer program that, when executed by a processor, implements a method of extracting an official document relation of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: the official document relation extraction problem is converted into a named entity identification problem, the entity type represents the relation between the text file and the official document entity, which entity type the official document entity appearing in the text file corresponds to is identified through the sequence marking model, and meanwhile, the text file is processed before identification, so that the length of the text file is shortened, relatively important information is highlighted, and the identification effect of the sequence marking model is ensured.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a document relation extraction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a main flow of a method for extracting official document relations according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a training principle of the sequence annotation model according to the embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating an extraction effect of the method for extracting a document relation according to the embodiment of the present invention.
FIG. 5 is a schematic diagram of the main modules of a document relation extraction apparatus according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 7 is a block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terms related to the embodiments are explained below.
Naming an entity: namely, the name of a person, the name of an organization, the name of a place, and all other entities identified by names.
Named entity recognition: the name Entity Recognition, NER for short. The method refers to identifying entities with specific meanings in texts, and mainly comprises name of a person, place name, organization name, proper noun and the like.
And (3) sequence labeling: the input of the sequence annotation model is represented by the observation sequence (i.e., X-X1, X2, X3, X4 … …), the output of the sequence annotation model is represented by the marker sequence (i.e., Y-Y1, Y2, Y3, Y4 … …), the observation sequence is typically a sentence, and the process of mapping the observation sequence to the marker sequence is referred to as sequence annotation.
FIG. 1 is a diagram illustrating the main steps of a document relation extraction method according to an embodiment of the present invention. As shown in fig. 1, the method for extracting a document relation according to the embodiment of the present invention mainly includes the following steps:
step S101: searching at least one official document entity appearing in the original text file, and screening the official document entity needing to extract the official document relation from the at least one official document entity as a target official document entity according to a set screening rule. The official document entity will typically co-appear with a particular punctuation mark, such as a title number. Therefore, when searching for the official document entity, the punctuation mark can be matched in the original text file, and then the official document entity is determined. Taking the book name number as an example, the book name number can be matched in the original text, and the text covered by the book name number is a document entity.
In actual business, the official document entities related to policy issuing and making need to be concerned, namely, the official document relation between the original text file and the official document entities needs to be extracted. Therefore, according to the requirement, a screening rule can be configured, the screening rule is used for screening out the official document entities relevant to policy issuing and formulating from the searched official document entities, and the screened official document entities are used as target official document entities.
Step S102: and replacing the target official document entity in the original text file by using the set first character string to obtain a new text file. And the length of the first character string is less than the text length of the target official document entity. The text length of the document entity is usually longer, and if the sequence marking model is directly used for identification, the model prediction speed is low, and the identification effect is poor. Therefore, the embodiment replaces the target official document entity with the first character string, for example, replaces the target official document entity with "official document" two characters, so as to reduce the text length of the input sequence annotation model.
Step S103: and inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence. The sequence labeling model needs to be trained in advance, and the model is used for labeling characters in an input text file and outputting a label sequence. Wherein the tag includes an entity type of the official document entity defined according to the kind of the official document relation.
And inputting the new text file into the trained sequence labeling model, processing the new text file by the sequence labeling model, specifically, labeling each character in a sentence containing the first character string in the new text file, generating a label sequence and outputting the label sequence.
Step S104: and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type. And defining the entity type of the official document entity in advance according to the type of the official document relation, and storing the association relation between the official document relation and the entity type. The entity type of the official document entity is identified in the tag sequence output in step S103, and the association relationship is further combined to determine the official document relationship corresponding to the entity type, thereby implementing the extraction of the official document relationship.
Fig. 2 is a schematic diagram of a main flow of a method for extracting a document relation according to an embodiment of the present invention. As shown in fig. 2, the method for extracting a document relation according to the embodiment of the present invention mainly includes the following steps:
step S201: and defining the entity type of the official document entity according to the type of the official document relation. In the embodiment, there are 8 kinds of document relations to be extracted, which are: according to, abolish, revise, mention, reply, implement, forward, and print. Thus, the entity type of the document entity is also defined as corresponding 8 types, which are: according to the official document entity, the disuse official document entity, the revision official document entity, the mention official document entity, the reply official document entity, the implementation official document entity, the forwarding official document entity and the printing official document entity.
Step S202: at least one document entity that appears is looked up from the original text file. When searching for the official document entity, the sub-text containing the designated punctuation mark can be searched from the original text file, and the part of the sub-text except the punctuation mark is taken as an official document entity. In an embodiment, the original text file may be a policy document file.
Since the official document in the policy official document is generally covered by the book title number, the punctuation mark here is the book title number. And searching the book name number from the original text file, namely matching the book name number in the original text file, and using the part with the book name number as an official document entity. For example, the original text file is: according to the relevant regulations of 'industry white paper', the official document entities are: trade white paper.
Step S203: and screening the official document entities needing to extract the official document relation from at least one official document entity as target official document entities according to a set screening rule. In actual business, the document entities related to policy issuing and making need to be concerned, that is, the document relationship between such document entities and the original text file needs to be extracted, but some document entities unrelated to policy issuing and making may exist in the original text file, and the document relationship does not need to be extracted.
Therefore, the at least one found official document entity can be divided into two types of official document entities needing to extract the official document relation and official document entities not needing to extract the official document relation according to the screening rule. The setting of the screening rule can be customized by a user, and the rule can screen out the official document entities needing to extract the official document relation. In an embodiment, the screening rules relate to policy issuing and formulation for the document entity. Illustratively, assume the contents of the original text file are as follows:
(1) the relevant department issued a policy on the requirement for enhanced market supervision (No. 203)
(2) In the notice, … … is explained about "Ming's civilization rules of culture
(3) According to the related regulation … … of "trade white paper
In the above example, the "requirement on enhanced market supervision" in section (1) is a document case of a comparative standard that needs attention, and is a document related to policy issuance and formulation, and a document relation needs to be extracted; the 'national civilization rules of conservation' in the part (2) is irrelevant to policy release and formulation, and the relation of the official documents does not need to be extracted; the "trade white paper" in section (3) is a document related to policy issuance and formulation, and the document relation needs to be extracted.
In order to accurately and quickly screen out the document entities needing to extract the document relationship from the document entities found in step S202, a text classification model may be trained in advance by using the labeled data of the document entities. And then screening the official document entities needing to extract the official document relation by using the text classification model.
In an embodiment, the text classification model may be fastText (which is a fast text classification algorithm), TextCNN (convolutional neural network for text classification), pre-trained semantic model (BERT), and the like. The specific implementation of using the text classification model to screen the official document entities needing to extract the official document relationship may be: inputting the document entities found in step S202 into a pre-trained text classification model, determining whether document relationships need to be extracted from each document entity according to the text classification model, and outputting classification results.
Step S204: and respectively replacing the target official document entity and the residual official document entity in the original text file by using the set first character string and the set second character string to obtain a new text file. After the process of step S203, the document entities that need to extract the document relationships are defined, the document entities that need to extract the document relationships are used as the target document entities, and the document entities that do not need to extract the document relationships are used as the remaining document entities.
Due to the fact that the titles of the document entities are usually long, it takes a long time to train the sequence labeling model, the training effect is poor (gradient explosion, gradient disappearance and the like), and the model prediction speed is slow. In addition, in the sentence containing the official document entity, the official document relation corresponding to the official document entity is judged from the context of the official document entity, and the official document has small influence on the official document relation, so that the official document entity is simplified in the embodiment.
Specifically, the target official document entity is replaced by the first character string, and the rest official document entities are replaced by the second character string. The first character string and the second character string can be Chinese characters, letters, numbers, symbols or any combination of the characters, the letters, the numbers and the symbols, the length of the first character string is smaller than the text length of the target official document entity, and the length of the second character string is smaller than the text length of the rest official document entities. It can be understood that, in this embodiment, the effect of reducing the length of the original text file can be achieved by only replacing the target document entity, and the remaining document entities may not be replaced.
In an embodiment, the target document entity may be replaced with a "document" two word, and the remaining document entities may be replaced with an "x". For example, for an original text file of 'notification about safe production work' is customized according to the relevant regulations of 'industry white paper', and after replacement, a new text file is generated as follows: according to the relevant provisions of the official documents, the letters of the star are made. The processing shortens the sentence length, highlights relatively important information and prevents the influence of overlong official document entities on the fitting of the sequence labeling model.
Step S205: and inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence. Extracting a sentence containing a first character string from the new text file, inputting the sentence into a pre-trained sequence labeling model, labeling characters in the sentence by the sequence labeling model, and outputting a label sequence. The tags comprise unrelated tags, start tags corresponding to the entity types, end tags corresponding to the entity types and entity types of the official document entities.
It can be understood that, if the new text file is a sentence, the new text file is directly input to the pre-trained sequence labeling model without extraction. The sequence annotation model can process each character in the sentence as follows:
firstly, coding characters in a sentence by using a sequence marking model to obtain a word vector corresponding to the characters; inputting the word vectors into a full-connection layer for dimensionality reduction, and fitting dimensionality reduction results with set labels to obtain the probability that the dimensionality reduction results belong to each label; and finally, determining the label of the character according to the probability, generating a label sequence and outputting the label sequence. In an embodiment, the sequence annotation model may be obtained by a BERT model.
Fig. 3 is a schematic diagram of a training principle of the sequence annotation model according to the embodiment of the present invention. The training process of the sequence annotation model is described below with reference to fig. 3.
Firstly, sentence samples in a first training set are labeled to obtain labels corresponding to characters in the sentence samples, then the labeled sentence samples are input into a BERT model to obtain word vectors corresponding to the characters in the labeled sentence samples, and then the word vectors are processed by using a full connection layer (Dense layer) and an activation function (such as softmax), so that a sequence labeling model can be obtained. Wherein, the sentence samples in the first training set are sentences containing the official document relation to be extracted.
When the sentence samples are labeled, the sentence samples in the first training set may be labeled by using the set identifier. In the embodiment, the irrelevant label is represented by 'O', the start label of the mth document entity is represented by 'Bm', the end label of the nth document entity is represented by 'Em', and the number m (m takes 1-8) represents the entity types of the 8 document entities.
For a sequence labeling task, before a BERT model is used for processing, a start mark (CLS) and an end mark (SEP) need to be added to a sentence sample, a mark Encoder (Token Encoder) and a Segment Encoder (Segment Encoder) are used for encoding each character in the sentence sample to obtain an encoding identifier corresponding to each character, then the encoding identifier is input into the BERT model to obtain a corresponding word vector, the word vector is input into a Dense layer for dimension reduction processing, and then a dimension reduction result is fitted to a label by using an activation function (such as softmax).
Specifically, the output of the BERT model is: < CLS > v1, v2 and v3 … … vn < SEP >, wherein vn represents a word vector corresponding to the nth character in the statement sample, and n is the length of the statement sample. The resulting word vector at this point is: n x768 (768 dimensions per word vector). Inputting the word vector of n x768 into the full-connection layer, reducing dimensions to obtain a dimension reduction result, fitting the dimension reduction result to the label, wherein the output dimension at the moment is as follows: n x number of labels.
Because there are 8 entity types corresponding to the official document relation to be extracted, each entity type corresponds to a start tag and an end tag, and there are irrelevant tags, the number of tags is: 8x3+1 ═ 25. The probability that a word vector corresponds to each tag is calculated using the softmax function, and the tag with the highest probability is taken as the tag of the word vector (i.e., the word vector is fitted to the tag).
After the label corresponding to each character in the statement sample is obtained, the identifier corresponding to the label is determined according to the corresponding relation between the label and the set identifier, and then the identifiers form a label sequence according to the sequence of the characters in the statement sample.
For example, suppose the input of the sequence labeling model is "determine to correct" according to the official documents of the department ", and the official documents relationship is as follows from the labeling data: according to the relationship. Assuming that the number "1" is used to denote "according to the official document entity", the "official document" two words correspond to the number "1", "B1", "E1", and other characters are denoted by the application "O".
Step S206: and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type. In this embodiment, the official document relation extraction problem is converted into a named entity recognition problem, the input of the sequence tagging model is a processed sentence containing an official document entity, and the output is a tag sequence, and the tag sequence can identify the type of the official document entity. The numbers 1-8 in the label sequence are the entity types to be extracted, and correspond to the 8 official document relations to be extracted.
Fig. 4 is a schematic diagram illustrating an extraction effect of the method for extracting a document relation according to the embodiment of the present invention. As shown in fig. 4, precision represents precision rate; recall represents recall; f1-score represents the F1 value, which is the harmonic mean of precision and recall; support represents the actual category number; macro avg represents macro average and is obtained by adding and averaging the precision rate, the recall rate and the F1 value of each category; micro avg represents micro average, sample types are not distinguished, and the overall precision rate, recall rate and F1 value are calculated; weighted avg represents a weighted average that accounts for the number of samples per category in the total samples; valid represents the representation in the verification set.
According to the data, all indexes of the official document relation extraction method meet the requirements, the labor cost is saved, and the accuracy is improved.
In an alternative embodiment, the training process of the text classification model is as follows: firstly, labeling the official document entity samples of a second training set to obtain classification results corresponding to the official document entity samples; the classification result comprises an official document relation needing to be extracted and an official document relation not needing to be extracted, and the official document entity sample is a sentence containing an official document entity; and then inputting the marked document entity sample into a BERT model, training the BERT model (namely inserting a CLS mark in front of the document entity sample, and performing text classification by using a CLS vector corresponding to the CLS mark), and thus obtaining a text classification model.
In the embodiment, the official document relation extraction problem is converted into the named entity recognition problem, the official document entities appearing in the government official document are classified, so that one entity type can represent the relation between the government official document and the official document entities (for example, the relation is expressed according to the official document entities), and the entity types of the official document entities appearing in the statement of the government official document containing the target official document entity are identified through sequence marking, so that the one-time extraction of all official document relations in the statement is realized, the named entity recognition problem is solved, and the accuracy is improved. Meanwhile, a brief character string is used for replacing a document entity, so that the model training fitting time is shortened.
Fig. 5 is a schematic diagram of main modules of a document relation extraction apparatus according to an embodiment of the present invention. As shown in fig. 5, the apparatus 500 for extracting document relation according to the embodiment of the present invention mainly includes:
the screening module 501 is configured to search at least one document entity appearing in the original text file, and screen a document entity that needs to be extracted as a target document entity from the at least one document entity according to a set screening rule. The official document entity will typically co-appear with a particular punctuation mark, such as a title number. Therefore, when searching for the official document entity, the punctuation mark can be matched in the original text file, and then the official document entity is determined. Taking the book name number as an example, the book name number can be matched in the original text, and the text covered by the book name number is a document entity.
In actual business, the official document entities related to policy issuing and making need to be concerned, namely, the official document relation between the original text file and the official document entities needs to be extracted. Therefore, according to the requirement, a screening rule can be configured, the screening rule is used for screening out the official document entities relevant to policy issuing and formulating from the searched official document entities, and the screened official document entities are used as target official document entities.
A replacing module 502, configured to replace the target official document entity in the original text file with the set first character string to obtain a new text file. And the length of the first character string is less than the text length of the target official document entity. The text length of the document entity is usually longer, and if the sequence marking model is directly used for identification, the model prediction speed is low, and the identification effect is poor. Therefore, the embodiment replaces the target official document entity with the first character string, for example, replaces the target official document entity with "official document" two characters, so as to reduce the text length of the input sequence annotation model.
And the marking module 503 is configured to input the new text file into a pre-trained sequence labeling model, label characters in the new text file with the sequence labeling model, and output a label sequence. The sequence labeling model needs to be trained in advance, and the model is used for labeling characters in an input text file and outputting a label sequence. Wherein the tag includes an entity type of the official document entity defined according to the kind of the official document relation.
And inputting the new text file into the trained sequence labeling model, processing the new text file by the sequence labeling model, specifically, labeling each character in a sentence containing the first character string in the new text file, generating a label sequence and outputting the label sequence.
A determining module 504, configured to determine, according to the association relationship between the official document relationship and the entity type, an official document relationship corresponding to the entity type in the tag sequence. And defining the entity type of the official document entity in advance according to the type of the official document relation, and storing the association relation between the official document relation and the entity type. The entity type of the official document entity is identified in the label sequence output by the marking module 503, and then the association relationship is combined, so that the official document relationship corresponding to the entity type can be determined, and the official document relationship extraction is realized.
In addition, the extraction 500 of the official document relation according to the embodiment of the present invention may further include: a first model training module, a second model training module, a processing module, and a definition module (not shown in FIG. 5). The system comprises a first model training module, a second model training module and a third model training module, wherein the first model training module is used for marking statement samples in a first training set to obtain labels corresponding to characters in the statement samples; and inputting the labeled sentence sample into a pre-training semantic model to obtain a word vector corresponding to characters in the labeled sentence sample, and processing the word vector by using a full-link layer and an activation function to obtain the sequence labeling model.
The second model training module is used for labeling the official document entity samples of the second training set to obtain classification results corresponding to the official document entity samples; the classification result comprises a required document relation extraction and a non-required document relation extraction; and inputting the marked official document entity sample into a pre-training semantic model, training the pre-training semantic model, and obtaining the text classification model.
And the processing module is used for replacing the official document entity which does not need to extract the official document relation in the original text file by using the set second character string. And the definition module is used for defining the entity type of the official document entity according to the type of the official document relation.
From the above description, it can be seen that the official document relation extraction problem is converted into a named entity identification problem, the entity type represents the relation between the text file and the official document entity, which entity type the official document entity appearing in the text file corresponds to is identified through the sequence marking model, and meanwhile, the text file is processed before identification, so that the length of the text file is shortened, relatively important information is highlighted, and the identification effect of the sequence marking model is ensured.
Fig. 6 shows an exemplary system architecture 600 to which the method or apparatus for extracting official document relations according to the embodiments of the present invention can be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server that provides various services, such as a background management server that provides support for original text files input by users using the terminal devices 601, 602, 603. The background management server can search the received original text file for the official document entity, filter the official document entity, replace the official document entity, label the characters in the new text file, determine the official document relationship and the like, and feed back the processing result (such as the official document relationship) to the terminal equipment.
It should be noted that the method for extracting the document relationship provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the document relationship extracting apparatus is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the official document relation extraction method of the embodiment of the invention.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a document relation extraction method of an embodiment of the present invention.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with the electronic device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a screening module, a replacement module, a marking module, and a determination module. The names of the modules do not form a limitation on the modules themselves in some cases, for example, the screening module may also be described as a module that searches at least one document entity appearing in the original text file, and screens out a document entity needing to extract document relationships from the at least one document entity as a target document entity according to a set screening rule.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: searching at least one official document entity appearing in the original text file, and screening the official document entity needing to extract the official document relation from the at least one official document entity as a target official document entity according to a set screening rule; replacing a target official document entity in the original text file by using a set first character string to obtain a new text file; wherein the length of the first character string is smaller than the text length of the target official document entity; inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence; wherein the label comprises an entity type of the official document entity defined according to the category of the official document relation; and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type.
According to the technical scheme of the embodiment of the invention, the official document relation extraction problem is converted into a named entity identification problem, the entity type represents the relation between the text file and the official document entity, which entity type corresponds to the official document entity appearing in the text file is identified through the sequence marking model, and meanwhile, the text file is processed before identification, so that the length of the text file is shortened, relatively important information is highlighted, and the identification effect of the sequence marking model is ensured.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (17)

1. A method for extracting official document relation, comprising:
searching at least one official document entity appearing in the original text file, and screening the official document entity needing to extract the official document relation from the at least one official document entity as a target official document entity according to a set screening rule;
replacing a target official document entity in the original text file by using a set first character string to obtain a new text file; wherein the length of the first character string is smaller than the text length of the target official document entity;
inputting the new text file into a pre-trained sequence labeling model, labeling characters in the new text file by the sequence labeling model, and outputting a label sequence; wherein the label comprises an entity type of the official document entity defined according to the category of the official document relation;
and determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type.
2. The method of claim 1, wherein the filtering rule is: the official document entity is related to policy issuing and making;
the screening of the official document entities needing to extract the official document relationship from the at least one official document entity as target official document entities comprises the following steps:
inputting the at least one official document entity into a pre-trained text classification model, judging whether the at least one official document entity needs to extract an official document relation or not by the text classification model, and outputting a classification result;
and taking the official document entity needing to extract the official document relation in the classification result as a target official document entity.
3. The method of claim 1, wherein inputting the new text file into a pre-trained sequence annotation model comprises:
extracting a sentence containing the first character string from the new text file, and inputting the sentence into a pre-trained sequence labeling model;
the labeling the characters in the new text file by the sequence labeling model, and outputting a label sequence, including:
and labeling the characters in the sentence by the sequence labeling model, and outputting a label sequence.
4. The method of claim 3, wherein the tagging characters in the sentence by the sequence tagging model, outputting a sequence of tags, comprises:
coding characters in the sentence by using the sequence labeling model to obtain word vectors corresponding to the characters;
inputting the word vectors into a full-connection layer for dimensionality reduction, and fitting dimensionality reduction results with set labels to obtain the probability that the dimensionality reduction results belong to the labels;
and determining the label of the character according to the probability, generating a label sequence and outputting the label sequence.
5. The method of claim 4, wherein the generating the tag sequence comprises:
determining an identifier corresponding to the label according to the corresponding relation between the label and a set identifier;
and forming the identifiers into a label sequence according to the sequence of the characters in the sentence.
6. The method of claim 4, further comprising:
labeling the statement samples in the first training set to obtain labels corresponding to characters in the statement samples;
and inputting the labeled sentence sample into a pre-training semantic model to obtain a word vector corresponding to characters in the labeled sentence sample, and processing the word vector by using a full-link layer and an activation function to obtain the sequence labeling model.
7. The method of claim 6, wherein labeling the sentence samples in the first training set comprises:
and labeling the sentence samples in the first training set by using a setting identifier.
8. The method of any of claims 3 to 7, wherein the tags comprise an irrelevant tag, a start tag and an end tag corresponding to the entity type.
9. The method of claim 2, further comprising:
labeling the official document entity samples of the second training set to obtain classification results corresponding to the official document entity samples; the classification result comprises a required document relation extraction and a non-required document relation extraction;
and inputting the marked official document entity sample into a pre-training semantic model, training the pre-training semantic model, and obtaining the text classification model.
10. The method according to any one of claims 1 to 7, 9, further comprising:
and replacing the official document entity which does not need to extract the official document relation in the original text file by using the set second character string.
11. The method according to any one of claims 1 to 7 and 9, wherein the searching for the at least one document entity appearing from the original text file comprises:
and searching a sub-text containing the designated punctuation marks from the original text file, and taking the part of the sub-text except the punctuation marks as an official document entity.
12. The method according to any one of claims 1 to 7, 9, further comprising:
and defining the entity type of the official document entity according to the type of the official document relation.
13. The method of claim 12, wherein the category of the official document relationship comprises one or more of: according to, abolish, revise, mention, reply, implement, forward and print;
the entity types include one or more of: according to the official document entity, the disuse official document entity, the revision official document entity, the mention official document entity, the reply official document entity, the implementation official document entity, the forwarding official document entity and the printing official document entity.
14. An official document relation extraction device, comprising:
the screening module is used for searching at least one document entity appearing in the original text file and screening the document entity needing to extract the document relation from the at least one document entity as a target document entity according to a set screening rule;
the replacing module is used for replacing a target official document entity in the original text file by using a set first character string to obtain a new text file; wherein the length of the first character string is smaller than the text length of the target official document entity;
the marking module is used for inputting the new text file into a pre-trained sequence marking model, marking characters in the new text file by the sequence marking model and outputting a label sequence; wherein the label comprises an entity type of the official document entity defined according to the category of the official document relation;
and the determining module is used for determining the official document relation corresponding to the entity type in the label sequence according to the incidence relation between the official document relation and the entity type.
15. The apparatus of claim 14, wherein the filtering rule is: the official document entity is related to policy issuing and making;
the screening module is further used for inputting the at least one official document entity into a pre-trained text classification model, judging whether the official document entity needs to extract an official document relation or not by the text classification model, and outputting a classification result; and
and taking the official document entity needing to extract the official document relation in the classification result as a target official document entity.
16. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-13.
17. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-13.
CN202110756360.4A 2021-07-05 2021-07-05 Method and device for extracting official document relation Pending CN113486651A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110756360.4A CN113486651A (en) 2021-07-05 2021-07-05 Method and device for extracting official document relation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110756360.4A CN113486651A (en) 2021-07-05 2021-07-05 Method and device for extracting official document relation

Publications (1)

Publication Number Publication Date
CN113486651A true CN113486651A (en) 2021-10-08

Family

ID=77939987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110756360.4A Pending CN113486651A (en) 2021-07-05 2021-07-05 Method and device for extracting official document relation

Country Status (1)

Country Link
CN (1) CN113486651A (en)

Similar Documents

Publication Publication Date Title
US20210157984A1 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
CN109697291B (en) Text semantic paragraph recognition method and device
US10423649B2 (en) Natural question generation from query data using natural language processing system
CN106919542B (en) Rule matching method and device
CA3048356A1 (en) Unstructured data parsing for structured information
CN113064964A (en) Text classification method, model training method, device, equipment and storage medium
US20230206670A1 (en) Semantic representation of text in document
US11954173B2 (en) Data processing method, electronic device and computer program product
CN110008807B (en) Training method, device and equipment for contract content recognition model
CN114003725A (en) Information annotation model construction method and information annotation generation method
CN111046627A (en) Chinese character display method and system
CN112487138A (en) Information extraction method and device for formatted text
CN111783424A (en) Text clause dividing method and device
CN111555960A (en) Method for generating information
CN114880520B (en) Video title generation method, device, electronic equipment and medium
CN112100364A (en) Text semantic understanding method and model training method, device, equipment and medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN111708819B (en) Method, apparatus, electronic device, and storage medium for information processing
CN113486651A (en) Method and device for extracting official document relation
CN115455416A (en) Malicious code detection method and device, electronic equipment and storage medium
CN110457436B (en) Information labeling method and device, computer readable storage medium and electronic equipment
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium
CN113515949A (en) Weakly supervised semantic entity recognition using general and target domain knowledge
Szegedi et al. Context-based Information Classification on Hungarian Invoices.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination