CN113158674A - Method for extracting key information of document in field of artificial intelligence - Google Patents

Method for extracting key information of document in field of artificial intelligence Download PDF

Info

Publication number
CN113158674A
CN113158674A CN202110353610.XA CN202110353610A CN113158674A CN 113158674 A CN113158674 A CN 113158674A CN 202110353610 A CN202110353610 A CN 202110353610A CN 113158674 A CN113158674 A CN 113158674A
Authority
CN
China
Prior art keywords
model
layer
text
input
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110353610.XA
Other languages
Chinese (zh)
Other versions
CN113158674B (en
Inventor
曲晨帆
金连文
林上港
马骏
刘振鑫
谭濯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110353610.XA priority Critical patent/CN113158674B/en
Publication of CN113158674A publication Critical patent/CN113158674A/en
Application granted granted Critical
Publication of CN113158674B publication Critical patent/CN113158674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a method for extracting key information of a document in the field of artificial intelligence, which comprises the following steps: s1, collecting document data in the artificial intelligence field, and performing key information extraction data annotation; s2, performing further pre-training on the pre-training model RoBERTA; s3, constructing an information extraction model; s4, initializing parameters of the backbone network by using a RoBERTA model obtained by further pre-training; s5, training by using the marked data, carrying out random replacement and data enhancement on the marked data in the training process, and calculating the error of back propagation by using the square cross entropy loss; and S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples. The method of the invention solves the problem that the performance effect is greatly reduced when a sequence labeling model deals with a long-span knowledge text by extracting information as a machine reading understanding task and predicting the positions of the starting point and the end point of each key information in the text.

Description

Method for extracting key information of document in field of artificial intelligence
Technical Field
The invention belongs to the technical field of artificial intelligence natural language processing, and particularly relates to a method for extracting key information of a document in the field of artificial intelligence.
Background
Massive unstructured text documents in the field of artificial intelligence science contain abundant knowledge, and if the unstructured text documents can be structured, ways for people to acquire related knowledge can be greatly enriched, and difficulty for people to acquire related knowledge is reduced. However, the traditional manual-dominated structuring method consumes a lot of human resources and is inefficient, and is not an optimal choice for solving the problem. In contrast, using machines for critical information extraction and knowledge structuring is a very efficient and economical approach.
At present, more and more key information extraction methods based on deep learning are proposed, but certain disadvantages still exist. The key information extraction method based on sequence annotation is more suitable for occasions with short text spans, but a complete result is difficult to obtain when subjects and objects with long text spans are faced. Although the information extraction model HBT based on machine reading understanding can alleviate the above problems, direct application is not effective. In addition, there are various knowledge types in knowledge texts in the field of natural science such as artificial intelligence, and it is not realistic to define these relationship types by covering with an exhaustive method, and although the open information extraction form can solve this problem, some research focuses on the open information extraction of the contents of a sentence, and most methods are implemented by syntactic analysis through rules predefined by human experts. In practical application, the expression modes of knowledge in related text contents are very variable, extraction needs to be performed from the perspective of the whole section, so that it is very difficult to define rules with wide coverage and strong expansibility, and extraction through a machine learning mode faces the problem of insufficient model generalization capability caused by large learning difficulty and less labeled data.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a method for extracting key information of a document in the field of artificial intelligence, which solves the problem of great performance effect reduction when a sequence labeling model deals with a long-span knowledge text by taking information extraction as a machine reading understanding task and predicting the positions of a starting point and an end point of each key information in the text.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for extracting document key information in the field of artificial intelligence comprises the following steps:
s1, collecting document data in the artificial intelligence field, and then performing key information extraction data annotation by using the collected data;
s2, performing further pre-training on the pre-training model RoBERTA in unstructured texts in the field of artificial intelligence;
s3, constructing an information extraction model;
s4, initializing backbone network parameters of the information extraction model by using the RoBERTA model obtained by further pre-training;
s5, training by using the marked data, carrying out random replacement and data enhancement on the marked data in the training process, and calculating the error of back propagation by using the square cross entropy loss;
and S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples, and integrating the result triples.
Further, the step S1 specifically includes:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network popular science knowledge related to the field of artificial intelligence, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triple to be extracted, specifically:
the general relationship definition method is adopted to define 5 triple types:
entity-description content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triple types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-description content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-inclusion content;
s13, labeling the defined triplet type, specifically:
opening a text to be annotated in an open source text annotation tool brat, selecting a certain segment of characters in the text to be annotated as a starting entity subject of a certain triple by using a mouse cursor, clicking and selecting the entity category of the subject in a popped up selection window, selecting an end entity object of the triple in the same way and selecting the category of the end entity object, finally generating a relation connecting line by selecting the subject in the triple by using a mouse and dragging the subject to the object, and selecting the category of the relation connecting line in the popped up selection window to finish the annotation of the triple; and repeating the steps until the labeling of all the triples in all the texts to be labeled is completed.
Further, the RoBERTa model specifically comprises an Embedding layer with three characteristic dimensions of 756, twelve transform layers with characteristic dimensions of 756, and a full-connection layer with an input channel number of 756 and an output channel number of 756, wherein the output channel number is the total number of character types in all training text data;
the three Embedding layers are a Token Embedding layer, a Position Embedding layer and a Segment Embedding layer respectively;
the three Embedding layers respectively map the text data of the input model into a feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756, and the feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756 obtained by adding the three output feature vectors is used as the integral output of the three Embedding layers and is used as the input of twelve transform layers of the RoBERTA model; the twelve transform layer outputs of the RoBERTa model are a feature vector shaped as the number of text segments input to the model x 512 x 756, and serve as inputs to the fully-connected layer, which is the probabilistic prediction result of the model for each character in each word in the input text segment that is replaced by a preset token, as each character in a dictionary that is the collection of all characters of all input training text segment data.
Further, the step S2 is specifically:
for a pre-training model RoBERTA, firstly, segmenting a training text by using a jieba word segmentation tool, and then initializing RoBERTA model parameters to be trained by using the pre-training RoBERTA model parameters; and then randomly replacing partial words in the word segmentation result by adopting a preset mark based on the word segmentation result of the jieba word segmentation tool in each iteration, inputting the processing result into a pre-training model RoBERTA, and predicting the words to be replaced by the mark by using the pre-training model RoBERTA.
Further, the constructing of the information extraction model specifically includes:
based on a Roberta model, adding a subject prediction module behind a transform layer at the 10 th layer of the Roberta model, adding a feature fusion module behind the subject prediction module, and adding a predict-object prediction module behind the feature fusion module;
the subject prediction module specifically comprises a full connection layer with 756 input channels and 2 output channels, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full connection layer;
the characteristic fusion module specifically comprises a full connection layer with input and output channel numbers of 1512 and 756 respectively, and a ReLU layer, a Dropout layer and a last two layers of transformers of Roberta which are connected with the full connection layer;
the prediction-object prediction module specifically comprises a full connection layer with an input channel number of 756 and an output channel number of 2 × the total number of prediction categories, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full connection layer.
Further, the input of the subject prediction module is a feature vector which is output by a 10 th layer transform layer of the information extraction model and has the shape of the number of text segments input into the model multiplied by 512 multiplied by 756, and the output is a probability prediction result which corresponds to 512 character positions of the original input information extraction model, wherein each character position is a starting point of the subject and each character position is an end point of the subject;
the feature fusion module fuses and inputs feature semantics of the subject into a feature vector output by a text segment of an information extraction model in a feature vector output by a transducer at the 10 th layer of a Roberta model to obtain a feature vector of the feature of the subject, wherein the input is the feature vector output by the transducer at the 10 th layer of the Roberta model and having the shape of the text segment quantity input into the model multiplied by 512 multiplied by 756, and the input is a selected subject starting point position label value and a selected subject end point position label value both having the shape of the text segment quantity input into the model multiplied by 1; the selected object is obtained by dynamically and randomly selecting one object from all the labeled objects of each sample in the training text data input into the model during iteration;
during training, the feature fusion module firstly selects vectors of corresponding positions in feature vectors output by a RoBERTA 10 th layer Transformer according to input starting and end positions of a subject to obtain two vectors with the text segment quantity multiplied by 756 input to a model, respectively copies the two vectors into 512 parts to obtain two vectors with the text segment quantity multiplied by 512 multiplied by 756 input to the model, splices the two vectors in feature dimensions to obtain one vector with the text segment quantity multiplied by 512 multiplied by 1512 input to the model, inputs the result into a full-link layer network of the feature fusion module to obtain an output vector with the text segment quantity multiplied by 512 multiplied by 756 input to the model, adds the output vector with the feature vectors output by the 10 th layer Transformer and then passes through the two layers of transformers of the feature fusion module to obtain the output of the feature fusion module;
the input of the prediction-object prediction module is a feature vector of the fused object feature output by the feature fusion module, and the output is the probability of each category of the object corresponding to the selected object, the probability of each category of the prediction between the selected object and the object, and the probability of the starting character position and the ending character position of the object in each character position of the text segment input to the information extraction model.
Further, the initializing the backbone network parameters in step S4 specifically includes:
initializing each Embedding layer of the information extraction model by using each Embedding layer of the Roberta model obtained by training, and initializing each Transformer layer of the information extraction model by using each Transformer layer of the Roberta model obtained by training;
the initial parameters of the full-connected layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by random sampling in normal distribution with the mean value of 0 and the variance of 2 ÷ the number of input channels of the layer.
Further, the step S5 specifically includes:
s51, extracting labeled data before the model from the input information, and performing random replacement and data enhancement on the labeled data to improve the generalization performance of the model and reduce overfitting;
s52, training by using the square binary cross entropy loss during training, specifically:
respectively squaring a subject prediction probability result and a predict-object prediction probability result output by a Sigmoid activation function layer before performing two-class cross entropy loss calculation;
and simultaneously calculating an error Ls corresponding to the subject prediction result and an error Lpo corresponding to the prediction-object prediction result by the quadratic binary cross entropy loss, wherein the final back propagation error is as follows:
Loss=k1×Ls+k2×Lpo
wherein k1 and k2 are selected according to actual conditions;
and S53, performing fine tuning training in the training process, wherein the learning rate is 1e-6 at first, then gradually increased to 5e-5, and finally gradually decreased.
Further, in step S51, the random replacement specifically includes:
in each iteration of the training process, before data is input into the information extraction model, one entity is replaced by another entity at random according to a certain probability, one entity attribute is replaced by another entity attribute at random according to a certain probability, application content is replaced by another application content at random according to a certain probability, contained content is replaced by another contained content at random according to a certain probability, and a presenter is replaced by another presenter at random according to a certain probability;
the data enhancement specifically comprises:
in each iteration of the training process, before data is input into the information extraction model, one word in the description content is replaced, added and deleted randomly.
Further, the step S6 specifically includes:
s61, firstly, inputting the text into an information extraction model to obtain the prediction probability results of the starting point and the end point of the subject at each position in the text sequence;
taking all positions with the probability of the predicted starting point being more than 0.5 as the starting positions of the subject, and taking all positions with the probability of the predicted end point being more than 0.5 as the predicted end positions of the subject;
for the starting point position prediction result of each subject, finding the end point position prediction result of the nearest subject behind the position in the text, matching the end point position prediction result with the starting point position prediction result, and taking the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point position and the end point position;
s62, combining the predicted start position and end position of the n paired objects into a batch to obtain an n multiplied by 2 vector;
simultaneously, for each pair of subjects, extracting the 10 th layer transform output feature vector corresponding to the corresponding text of the subject to obtain an expanded 10 th layer transform output feature vector, wherein the shape of the expanded 10 th layer transform output feature vector is nx512 x 756;
respectively taking out the content of corresponding positions of the expanded 10 th layer transform output feature vectors according to the starting point and the end point of each object, respectively obtaining a starting point vector of nx756 and an end point vector of nx756, respectively copying 512 parts of the starting point vector and the end point vector of nx756 to obtain vectors of nx512 x 756 and nx512 x 756, splicing the two vectors in a feature dimension to obtain a feature vector of nx512 x 1512, passing the feature vector through a full connection layer of a feature fusion module to obtain a feature vector of nx512 x 756, adding the obtained feature vector and the feature vector obtained by the expanded 10 th layer transform, and passing the added feature vector through two layers of transforms of the feature fusion module to obtain a feature vector of fused objects;
s63, inputting the feature vector of the fusion subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
taking the position with the prediction probability of all the category starting points being more than 0.5 as the category starting point position, and taking the position with the prediction probability of all the category ending points being more than 0.5 as the category ending point position;
for the starting point position prediction result of each predict category, finding the nearest end point position prediction result of the same predict category which is later than the position in the text to pair with the same predict category, and taking the content of the corresponding position in the corresponding text as the object result according to the starting point position and the end point position of each pair of predict categories;
s64, regarding the extracted object-predicate-object triple, if the entity attribute is taken as the object, firstly finding out an entity corresponding to the entity attribute as the triple of the object, namely the entity attribute-pseudo relation 1-entity, then finding out all triples corresponding to the entity attribute, which are not the entity as the object, namely the entity attribute-pseudo relation 2-description content, the entity attribute-pseudo relation 3-application content and the entity attribute-pseudo relation 4-inclusion content, then combining the triples with the common entity attribute as the predicate after removing the pseudo relation into a new triple, namely the entity-entity attribute-content, and finally realizing the open information extraction;
and if the extracted triple does not take the entity attribute as the subject, directly taking the extracted triple as a result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the method, information extraction is used as a machine reading understanding task to be solved, the starting point and the end point of each key information in the text are predicted, and the problem that the performance effect is greatly reduced when a sequence labeling model deals with a long-span knowledge text is solved; by utilizing the method, infinite relation types can be extracted, and the closed and open information extraction methods are combined into the same frame, so that the accuracy of information extraction is improved.
2. The method plays the advantages of the pre-training model, still shows strong generalization performance under the condition of not rich labeled samples, and can cope with the whole text and variable knowledge expression forms.
3. The method is improved based on the HBT model, and the subject prediction module and the feature fusion module are arranged at proper positions, so that the model keeps proper feature sharing and improves the performance, and simultaneously solves the problem of negative influence caused by the span difference between the subject and the object, thereby improving the overall performance of the information extraction model.
4. The method optimizes the model by using the quadratic binary cross entropy loss, has the effect of on-line difficult sample mining, ensures that the model pays more attention to the correct selection of the starting point and the end point, relieves the unbalanced interference of positive and negative samples caused by a large number of negative samples, enlarges the classification boundary of the positive samples and further improves the overall performance of the model; the generalization capability of the model is improved by carrying out random intra-class interchange on entities, entity attributes, application contents, contained contents, alternative names and proposers and by carrying out random operation on description contents.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a flow chart of the information extraction model training steps of the present invention;
FIG. 3 is a diagram of a subject prediction module of the information extraction model of the present invention;
FIG. 4 is a diagram of a feature fusion module of the information extraction model of the present invention;
FIG. 5 is a diagram of a predict-object prediction module of the information extraction model of the present invention;
FIG. 6 is a flow diagram of information extraction model inference of the present invention;
FIG. 7 is a diagram of the pseudo-relationship knowledge consolidation method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
As shown in FIG. 1, the invention provides a method for extracting document key information in the field of artificial intelligence, which comprises the following steps:
s1, collecting document data in the field of artificial intelligence, and then performing key information extraction data annotation by using the collected data, which specifically comprises:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network popular science knowledge related to the field of artificial intelligence, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triple to be extracted, specifically:
the general relationship definition is used to define 5 triple types:
entity-description content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triple types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-description content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-inclusion content.
S13, labeling the defined triplet type, specifically:
opening a text to be annotated in an open source text annotation tool brat, selecting a certain segment of characters in the text to be annotated as a starting entity subject of a certain triple by using a mouse cursor, clicking and selecting the entity category of the subject in a popped up selection window, selecting an end entity object of the triple in the same way and selecting the category of the end entity object, finally generating a relation connecting line by selecting the subject in the triple by using a mouse and dragging the subject to the object, and selecting the category of the relation connecting line in the popped up selection window to finish the annotation of the triple; and repeating the steps until the labeling of all the triples in all the texts to be labeled is completed.
S2, the pre-training model RoBERTA is pre-trained in an unstructured text in the field of artificial intelligence in a self-supervision mode, and the method specifically comprises the following steps:
for a pre-training model RoBERTA, firstly, segmenting a training text by using a jieba word segmentation tool, and then initializing RoBERTA model parameters to be trained by using the pre-training RoBERTA model parameters; then, in each iteration, based on the word segmentation result of the jieba word segmentation tool, a preset mark is adopted to randomly replace partial words in the word segmentation result, the processing result is input into a pre-training model RoBERTA, and then the pre-training model RoBERTA is used to predict words which are marked and replaced; in this embodiment, the adopted preset marks are: [ MASK ];
the RoBERTA model specifically comprises an Embellding layer with 756 characteristic dimensions, twelve transform layers with 756 characteristic dimensions and a full-connection layer with 756 input channels and 756 output channels as the total number of character types in all training text data;
the three Embedding layers are a Token Embedding layer, a Position Embedding layer and a Segment Embedding layer respectively;
the three Embedding layers respectively map the text data of the input model into a feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756, and the feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756 obtained by adding the three output feature vectors is used as the integral output of the three Embedding layers and is used as the input of twelve transform layers of the RoBERTA model; the twelve transform layer outputs of the RoBERTa model are a feature vector with the shape of the number of text segments input into the model multiplied by 512 multiplied by 756 and serve as the input of a full-link layer, the output of the full-link layer is the probability prediction result of the model for each character in each word replaced by a preset mark in the input text segment as each character in a dictionary, and the dictionary is the set of all characters of all input training text segment data; in this embodiment, the adopted preset marks are: [ MASK ].
S3, constructing an information extraction model, specifically:
based on a Roberta model, adding a subject prediction module behind a transform layer at the 10 th layer of the Roberta model, adding a feature fusion module behind the subject prediction module, and adding a predict-object prediction module behind the feature fusion module;
as shown in fig. 3, the subject prediction module specifically includes a full connection layer with 756 input channels and 2 output channels, and a ReLU layer, a Dropout layer, and a Sigmoid activation function layer connected to the full connection layer;
as shown in fig. 4, the feature fusion module specifically includes a fully-connected layer with input and output channel numbers 1512 and 756, and a ReLU layer, a Dropout layer, and a last two layers of transformers of RoBERTa connected thereto;
as shown in fig. 5, the predict-object prediction module specifically includes a full connection layer with an input channel number of 756 and an output channel number of 2 × the total number of predict categories, and a ReLU layer, a Dropout layer, and a Sigmoid activation function layer connected to the full connection layer.
In this embodiment, the subject module uses the feature vectors output by the layer 10 transform layer to predict the probability of the starting point and the ending point of all subjects in the input text paragraph in parallel, and the input is the shape output by the layer 10 transform layer of the information extraction model: inputting the feature vector of text segment quantity multiplied by 512 multiplied by 756 into the model, wherein the output is the probability prediction result of 512 character positions corresponding to the original input information extraction model, each character position is the starting point of the subject and each character position is the ending point of the subject;
the feature fusion module fuses and inputs feature semantics of the subject into a feature vector output by a text segment of an information extraction model in a feature vector output by a transducer at the 10 th layer of a Roberta model to obtain a feature vector of the feature of the subject, wherein the input is the feature vector output by the transducer at the 10 th layer of the Roberta model and having the shape of the text segment quantity input into the model multiplied by 512 multiplied by 756, and the input is a selected subject starting point position label value and a selected subject end point position label value both having the shape of the text segment quantity input into the model multiplied by 1; the selected object is obtained by dynamically and randomly selecting one object from all the labeled objects of each sample in a batch during iteration;
during training, as shown in fig. 2, the feature fusion module selects vectors of corresponding positions in feature vectors output by RoBERTa layer 10 transform according to input start and end positions to obtain two vectors of text segment number × 756 input to the model, copies the two vectors respectively 512 parts to obtain two vectors of text segment number × 512 × 756 input to the model, splices the two vectors in feature dimensions to obtain one vector of text segment number × 512 × 1512 input to the model, inputs the result to the full-connection layer network of the feature fusion module to obtain output vectors of text segment number × 512 × 756 input to the model, adds the output vectors and feature vectors output by the layer 10 transform, and then passes through the two layers of transforms of the feature fusion module to obtain the output of the feature fusion module;
the input of the prediction-object prediction module is the feature vector of the fused object feature output by the feature fusion module, and the output is the respective class probabilities of the object corresponding to the selected object, the prediction respective class probabilities between the selected object and the object, and the probabilities of the start character position and the end character position of the object in the respective character positions of the text segment input to the information extraction model.
S4, initializing backbone network parameters by using a RoBERTA model obtained by further pre-training, specifically:
initializing each Embedding layer of the information extraction model by using each Embedding layer of the Roberta model obtained by training, and initializing each Transformer layer of the information extraction model by using each Transformer layer of the Roberta model obtained by training;
the initial parameters of the full-connected layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by random sampling in normal distribution with the mean value of 0 and the variance of 2 ÷ the number of input channels of the layer.
S5, training by using the labeled data, randomly replacing the labeled data in the training process, enhancing the data to generalize the model, reducing overfitting, and calculating the error of back propagation by using the square cross entropy loss, which specifically comprises the following steps:
s51, extracting labeled data before the model from the input information, and performing random replacement and data enhancement on the labeled data to improve the generalization performance of the model and reduce overfitting;
s52, training by using the square binary cross entropy loss during training, specifically:
respectively squaring a subject prediction probability result and a predict-object prediction probability result output by a Sigmoid activation function layer before performing two-class cross entropy loss calculation;
and simultaneously calculating an error Ls corresponding to the subject prediction result and an error Lpo corresponding to the prediction-object prediction result by the quadratic binary cross entropy loss, wherein the final back propagation error is as follows:
Loss=k1×Ls+k2×Lpo
wherein k1 and k2 are selected according to actual conditions, and k1 and k2 in the embodiment are 1;
and S53, performing fine tuning training in the training process, wherein the learning rate is 1e-6 at first, then gradually increased to 5e-5, and finally gradually decreased.
In this embodiment, the random replacement specifically includes:
in each iteration of the training process, before data is input into the information extraction model, one entity is replaced by another entity at random according to a certain probability, one entity attribute is replaced by another entity attribute at random according to a certain probability, application content is replaced by another application content at random according to a certain probability, contained content is replaced by another contained content at random according to a certain probability, and a presenter is replaced by another presenter at random according to a certain probability;
the data enhancement specifically comprises:
in each iteration of the training process, before data is input into the model, one word in the description content is replaced, added and deleted randomly.
S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples, and integrating the result triples, as shown in FIG. 6, specifically:
s61, firstly, inputting the text into an information extraction model to obtain the prediction probability results of the starting point and the end point of the subject at each position in the text sequence;
taking all positions with the probability of the predicted starting point being more than 0.5 as the starting positions of the subject, and taking all positions with the probability of the predicted end point being more than 0.5 as the predicted end positions of the subject;
for the starting point position prediction result of each subject, finding the end point position prediction result of the nearest subject behind the position in the text, matching the end point position prediction result with the starting point position prediction result, and taking the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point position and the end point position;
s62, combining the predicted start position and end position of the n paired objects into a batch to obtain an n multiplied by 2 vector;
simultaneously, for each pair of subjects, extracting the 10 th layer transform output feature vector corresponding to the corresponding text of the subject to obtain an expanded 10 th layer transform output feature vector, wherein the shape of the expanded 10 th layer transform output feature vector is nx512 x 756;
respectively taking out the content of the corresponding position of the output feature vector of the expanded 10 th layer of transducer according to the starting point position and the end point position of each object, respectively obtaining a starting point vector of nx756 and an end point vector of nx756, respectively copying 512 parts of the output feature vectors to obtain vectors of nx512 x 756 and nx512 x 756, splicing the vectors in feature dimensions to obtain a feature vector of nx512 x 1512, passing the feature vector through a full connection layer of a feature fusion module to obtain a feature vector of nx512 x 756, adding the feature vector with the feature vector obtained by the expanded 10 th layer of transducer, and passing through two layers of transducers of the feature fusion module to obtain a feature vector of fused objects;
s63, inputting the feature vector of the fusion subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
taking the position with the prediction probability of all the category starting points being more than 0.5 as the category starting point position, and taking the position with the prediction probability of all the category ending points being more than 0.5 as the category ending point position;
for the starting point position prediction result of each predict category, finding the nearest end point position prediction result of the same predict category which is later than the position in the text to pair with the same predict category, and taking the content of the corresponding position in the corresponding text as the object result according to the starting point position and the end point position of each pair of predict categories;
s64, as shown in fig. 7, for the extracted object-predictor-object triple, if the entity attribute is used as the object, first, find out an entity corresponding to the entity attribute as the triple of the object, that is, the entity attribute-pseudo relationship 1-entity, then find out all triples corresponding to the entity attribute, that is, the entity attribute-pseudo relationship 2-description content, the entity attribute-pseudo relationship 3-application content, and the entity attribute-pseudo relationship 4-inclusion content, which are not the entity as the object, and then remove the pseudo relationships from the triples, and merge the triples into a new triple by using the common entity attribute as predictor: entity-entity attribute-content, achieving the purpose of open information extraction;
and if the extracted triple does not take the entity attribute as the subject, directly taking the extracted triple as a result.
It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for extracting document key information in the field of artificial intelligence is characterized by comprising the following steps:
s1, collecting document data in the artificial intelligence field, and then performing key information extraction data annotation by using the collected data;
s2, performing further pre-training on the pre-training model RoBERTA in unstructured texts in the field of artificial intelligence;
s3, constructing an information extraction model;
s4, initializing backbone network parameters of the information extraction model by using the RoBERTA model obtained by further pre-training;
s5, training by using the marked data, carrying out random replacement and data enhancement on the marked data in the training process, and calculating the error of back propagation by using the square cross entropy loss;
and S6, extracting information in the unstructured text in the field of artificial intelligence by using the trained information extraction model to obtain result triples, and integrating the result triples.
2. The method for extracting key information of documents in the field of artificial intelligence according to claim 1, wherein said step S1 specifically comprises:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network popular science knowledge related to the field of artificial intelligence, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triple to be extracted, specifically:
the general relationship definition method is adopted to define 5 triple types:
entity-description content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triple types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-description content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-inclusion content;
s13, labeling the defined triplet type, specifically:
opening a text to be annotated in an open source text annotation tool brat, selecting a certain segment of characters in the text to be annotated as a starting entity subject of a certain triple by using a mouse cursor, clicking and selecting the entity category of the subject in a popped up selection window, selecting an end entity object of the triple in the same way and selecting the category of the end entity object, finally generating a relation connecting line by selecting the subject in the triple by using a mouse and dragging the subject to the object, and selecting the category of the relation connecting line in the popped up selection window to finish the annotation of the triple; and repeating the steps until the labeling of all the triples in all the texts to be labeled is completed.
3. The method for extracting key information of documents in the field of artificial intelligence as claimed in claim 1, wherein the RoBERTa model specifically includes an Embedding layer with three characteristic dimensions of 756, a Transformer layer with twelve characteristic dimensions of 756, and a fully-connected layer with an input channel number of 756 and an output channel number of total number of character types in all training text data;
the three Embedding layers are a Token Embedding layer, a Position Embedding layer and a Segment Embedding layer respectively;
the three Embedding layers respectively map the text data of the input model into a feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756, and the feature vector with the shape of the number of the text segments input into the model multiplied by 512 multiplied by 756 obtained by adding the three output feature vectors is used as the integral output of the three Embedding layers and is used as the input of twelve transform layers of the RoBERTA model; the twelve transform layer outputs of the RoBERTa model are a feature vector shaped as the number of text segments input to the model x 512 x 756, and serve as inputs to the fully-connected layer, which is the probabilistic prediction result of the model for each character in each word in the input text segment that is replaced by a preset token, as each character in a dictionary that is the collection of all characters of all input training text segment data.
4. The method for extracting key information of documents in the field of artificial intelligence according to claim 1, wherein said step S2 specifically comprises:
for a pre-training model RoBERTA, firstly, segmenting a training text by using a jieba word segmentation tool, and then initializing RoBERTA model parameters to be trained by using the pre-training RoBERTA model parameters; and then randomly replacing partial words in the word segmentation result by adopting a preset mark based on the word segmentation result of the jieba word segmentation tool in each iteration, inputting the processing result into a pre-training model RoBERTA, and predicting the words to be replaced by the mark by using the pre-training model RoBERTA.
5. The method for extracting the key information of the document in the field of artificial intelligence according to claim 3, wherein the constructing of the information extraction model specifically comprises:
based on a Roberta model, adding a subject prediction module behind a transform layer at the 10 th layer of the Roberta model, adding a feature fusion module behind the subject prediction module, and adding a predict-object prediction module behind the feature fusion module;
the subject prediction module specifically comprises a full connection layer with 756 input channels and 2 output channels, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full connection layer;
the characteristic fusion module specifically comprises a full connection layer with input and output channel numbers of 1512 and 756 respectively, and a ReLU layer, a Dropout layer and a last two layers of transformers of Roberta which are connected with the full connection layer;
the prediction-object prediction module specifically comprises a full connection layer with an input channel number of 756 and an output channel number of 2 × the total number of prediction categories, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full connection layer.
6. The method for extracting key information from documents in the field of artificial intelligence as claimed in claim 5, wherein the input of the subject prediction module is a feature vector with the shape of the number of text segments input to the model multiplied by 512 multiplied by 756 output of the 10 th layer transform layer of the information extraction model, and the output is a probability prediction result with each character position as the start point of the subject and a probability prediction result with each character position as the end point of the subject at 512 character positions corresponding to the original input information extraction model;
the feature fusion module fuses and inputs feature semantics of the subject into a feature vector output by a text segment of an information extraction model in a feature vector output by a transducer at the 10 th layer of a Roberta model to obtain a feature vector of the feature of the subject, wherein the input is the feature vector output by the transducer at the 10 th layer of the Roberta model and having the shape of the text segment quantity input into the model multiplied by 512 multiplied by 756, and the input is a selected subject starting point position label value and a selected subject end point position label value both having the shape of the text segment quantity input into the model multiplied by 1; the selected object is obtained by dynamically and randomly selecting one object from all the labeled objects of each sample in the training text data input into the model during iteration;
during training, the feature fusion module firstly selects vectors of corresponding positions in feature vectors output by a RoBERTA 10 th layer Transformer according to input starting and end positions of a subject to obtain two vectors with the text segment quantity multiplied by 756 input to a model, respectively copies the two vectors into 512 parts to obtain two vectors with the text segment quantity multiplied by 512 multiplied by 756 input to the model, splices the two vectors in feature dimensions to obtain one vector with the text segment quantity multiplied by 512 multiplied by 1512 input to the model, inputs the result into a full-link layer network of the feature fusion module to obtain an output vector with the text segment quantity multiplied by 512 multiplied by 756 input to the model, adds the output vector with the feature vectors output by the 10 th layer Transformer and then passes through the two layers of transformers of the feature fusion module to obtain the output of the feature fusion module;
the input of the prediction-object prediction module is a feature vector of the fused object feature output by the feature fusion module, and the output is the probability of each category of the object corresponding to the selected object, the probability of each category of the prediction between the selected object and the object, and the probability of the starting character position and the ending character position of the object in each character position of the text segment input to the information extraction model.
7. The method for extracting key information of documents in the field of artificial intelligence according to claim 5, wherein the initializing of backbone network parameters in step S4 specifically comprises:
initializing each Embedding layer of the information extraction model by using each Embedding layer of the Roberta model obtained by training, and initializing each Transformer layer of the information extraction model by using each Transformer layer of the Roberta model obtained by training;
the initial parameters of the fully-connected layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by random sampling in normal distribution with the average value of 0 and the variance of 2 of the number of input channels of the layer.
8. The method for extracting key information of documents in the field of artificial intelligence according to claim 6, wherein said step S5 specifically comprises:
s51, extracting labeled data before the model from the input information, and performing random replacement and data enhancement on the labeled data to improve the generalization performance of the model and reduce overfitting;
s52, training by using the square binary cross entropy loss during training, specifically:
respectively squaring a subject prediction probability result and a predict-object prediction probability result output by a Sigmoid activation function layer before performing two-class cross entropy loss calculation;
and simultaneously calculating an error Ls corresponding to the subject prediction result and an error Lpo corresponding to the prediction-object prediction result by the quadratic binary cross entropy loss, wherein the final back propagation error is as follows:
Loss=k1×Ls+k2×Lpo
wherein k1 and k2 are selected according to actual conditions;
and S53, performing fine tuning training in the training process, wherein the learning rate is 1e-6 at first, then gradually increased to 5e-5, and finally gradually decreased.
9. The method for extracting key information of documents in the field of artificial intelligence according to claim 8, wherein in said step S51, the random replacement specifically comprises:
in each iteration of the training process, before data is input into the information extraction model, one entity is replaced by another entity at random according to a certain probability, one entity attribute is replaced by another entity attribute at random according to a certain probability, application content is replaced by another application content at random according to a certain probability, contained content is replaced by another contained content at random according to a certain probability, and a presenter is replaced by another presenter at random according to a certain probability;
the data enhancement specifically comprises:
in each iteration of the training process, before data is input into the information extraction model, one word in the description content is replaced, added and deleted randomly.
10. The method for extracting key information from a document in the field of artificial intelligence according to claim 2 or 6, wherein the step S6 specifically includes:
s61, firstly, inputting the text into an information extraction model to obtain the prediction probability results of the starting point and the end point of the subject at each position in the text sequence;
taking all positions with the probability of the predicted starting point being more than 0.5 as the starting positions of the subject, and taking all positions with the probability of the predicted end point being more than 0.5 as the predicted end positions of the subject;
for the starting point position prediction result of each subject, finding the end point position prediction result of the nearest subject behind the position in the text, matching the end point position prediction result with the starting point position prediction result, and taking the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point position and the end point position;
s62, combining the predicted start position and end position of the n paired objects into a batch to obtain an n multiplied by 2 vector;
simultaneously, for each pair of subjects, extracting the 10 th layer transform output feature vector corresponding to the corresponding text of the subject to obtain an expanded 10 th layer transform output feature vector, wherein the shape of the expanded 10 th layer transform output feature vector is nx512 x 756;
respectively taking out the content of corresponding positions of the expanded 10 th layer transform output feature vectors according to the starting point and the end point of each object, respectively obtaining a starting point vector of nx756 and an end point vector of nx756, respectively copying 512 parts of the starting point vector and the end point vector of nx756 to obtain vectors of nx512 x 756 and nx512 x 756, splicing the two vectors in a feature dimension to obtain a feature vector of nx512 x 1512, passing the feature vector through a full connection layer of a feature fusion module to obtain a feature vector of nx512 x 756, adding the obtained feature vector and the feature vector obtained by the expanded 10 th layer transform, and passing the added feature vector through two layers of transforms of the feature fusion module to obtain a feature vector of fused objects;
s63, inputting the feature vector of the fusion subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
taking the position with the prediction probability of all the category starting points being more than 0.5 as the category starting point position, and taking the position with the prediction probability of all the category ending points being more than 0.5 as the category ending point position;
for the starting point position prediction result of each predict category, finding the nearest end point position prediction result of the same predict category which is later than the position in the text to pair with the same predict category, and taking the content of the corresponding position in the corresponding text as the object result according to the starting point position and the end point position of each pair of predict categories;
s64, regarding the extracted object-predicate-object triple, if the entity attribute is taken as the object, firstly finding out an entity corresponding to the entity attribute as the triple of the object, namely the entity attribute-pseudo relation 1-entity, then finding out all triples corresponding to the entity attribute, which are not the entity as the object, namely the entity attribute-pseudo relation 2-description content, the entity attribute-pseudo relation 3-application content and the entity attribute-pseudo relation 4-inclusion content, then combining the triples with the common entity attribute as the predicate after removing the pseudo relation into a new triple, namely the entity-entity attribute-content, and finally realizing the open information extraction;
and if the extracted triple does not take the entity attribute as the subject, directly taking the extracted triple as a result.
CN202110353610.XA 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field Active CN113158674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110353610.XA CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110353610.XA CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Publications (2)

Publication Number Publication Date
CN113158674A true CN113158674A (en) 2021-07-23
CN113158674B CN113158674B (en) 2023-07-25

Family

ID=76886346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110353610.XA Active CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Country Status (1)

Country Link
CN (1) CN113158674B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297987A (en) * 2022-03-09 2022-04-08 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114861629A (en) * 2022-04-29 2022-08-05 电子科技大学 Automatic judgment method for text style
CN116720502A (en) * 2023-06-20 2023-09-08 中国航空综合技术研究所 Aviation document information extraction method based on machine reading understanding and template rules

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
US20200380211A1 (en) * 2019-05-31 2020-12-03 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus, computer device and readable medium for knowledge hierarchical extraction of a text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200380211A1 (en) * 2019-05-31 2020-12-03 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus, computer device and readable medium for knowledge hierarchical extraction of a text
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHONG Z ER.AL: "DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, pages 1208 - 1212 *
赵万鹏 等: "基于Adaboost的手写体数字识别", 计算机应用, vol. 25, no. 10, pages 576 - 589 *
金连文 等: "深度学习在手写汉字识别中的应用综述", 自动化学报, vol. 42, no. 8, pages 1125 - 1141 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297987A (en) * 2022-03-09 2022-04-08 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114297987B (en) * 2022-03-09 2022-07-19 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114861629A (en) * 2022-04-29 2022-08-05 电子科技大学 Automatic judgment method for text style
CN114861629B (en) * 2022-04-29 2023-04-04 电子科技大学 Automatic judgment method for text style
CN116720502A (en) * 2023-06-20 2023-09-08 中国航空综合技术研究所 Aviation document information extraction method based on machine reading understanding and template rules
CN116720502B (en) * 2023-06-20 2024-04-05 中国航空综合技术研究所 Aviation document information extraction method based on machine reading understanding and template rules

Also Published As

Publication number Publication date
CN113158674B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN109918671B (en) Electronic medical record entity relation extraction method based on convolution cyclic neural network
CN108090070B (en) Chinese entity attribute extraction method
CN113158674A (en) Method for extracting key information of document in field of artificial intelligence
CN110750635B (en) French recommendation method based on joint deep learning model
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN113168499A (en) Method for searching patent document
CN108073576A (en) Intelligent search method, searcher and search engine system
CN113196277A (en) System for retrieving natural language documents
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN114419642A (en) Method, device and system for extracting key value pair information in document image
Rahman Understanding the logical and semantic structure of large documents
CN111651569B (en) Knowledge base question-answering method and system in electric power field
CN113868380A (en) Few-sample intention identification method and device
CN113312922A (en) Improved chapter-level triple information extraction method
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN110110137A (en) A kind of method, apparatus, electronic equipment and the storage medium of determining musical features
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
Wang Research on the art value and application of art creation based on the emotion analysis of art
CN114117069A (en) Semantic understanding method and system for intelligent knowledge graph question answering
Zhao et al. POS-ATAEPE-BiLSTM: an aspect-based sentiment analysis algorithm considering part-of-speech embedding
CN113869049A (en) Fact extraction method and device with legal attribute based on legal consultation problem
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
Hicham et al. Enhancing Arabic E-Commerce Review Sentiment Analysis Using a hybrid Deep Learning Model and FastText word embedding
Phan et al. Sentence-level sentiment analysis using gcn on contextualized word representations
CN112948544B (en) Book retrieval method based on deep learning and quality influence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant