CN113158674B - Method for extracting key information of documents in artificial intelligence field - Google Patents

Method for extracting key information of documents in artificial intelligence field Download PDF

Info

Publication number
CN113158674B
CN113158674B CN202110353610.XA CN202110353610A CN113158674B CN 113158674 B CN113158674 B CN 113158674B CN 202110353610 A CN202110353610 A CN 202110353610A CN 113158674 B CN113158674 B CN 113158674B
Authority
CN
China
Prior art keywords
model
layer
subject
prediction
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110353610.XA
Other languages
Chinese (zh)
Other versions
CN113158674A (en
Inventor
曲晨帆
金连文
林上港
马骏
刘振鑫
谭濯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110353610.XA priority Critical patent/CN113158674B/en
Publication of CN113158674A publication Critical patent/CN113158674A/en
Application granted granted Critical
Publication of CN113158674B publication Critical patent/CN113158674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for extracting key information of a document in the field of artificial intelligence, which comprises the following steps: s1, collecting document data in the artificial intelligence field, and carrying out key information extraction data annotation; s2, further pre-training a pre-training model RoBERTa; s3, constructing an information extraction model; s4, initializing backbone network parameters by using a RoBERTa model obtained through further pre-training; s5, training by using marked data, randomly replacing and enhancing the marked data in the training process, and calculating a counter-propagation error by using square cross entropy loss; s6, information extraction is carried out in unstructured text in the artificial intelligence field by using the information extraction model obtained through training, and a result triplet is obtained. According to the method, information extraction is used as a machine reading understanding task to solve, the starting point and the end point of each key information in the text are predicted, and the problem that the performance effect is greatly reduced when the sequence annotation model is applied to the long-span knowledge text is solved.

Description

Method for extracting key information of documents in artificial intelligence field
Technical Field
The invention belongs to the technical field of artificial intelligence natural language processing, and particularly relates to a method for extracting document key information in the artificial intelligence field.
Background
The massive unstructured text documents in the field of artificial intelligence science contain rich knowledge, and if the unstructured text documents can be structured, the way of acquiring related knowledge by people can be greatly enriched, and the difficulty of acquiring related knowledge by people is reduced. However, traditional manually dominated structuring approaches consume significant human resources and are inefficient and are not the optimal option to address this problem. Conversely, using machines for key information extraction and knowledge structuring is a very efficient and economical method.
At present, more and more key information extraction methods based on deep learning are proposed, but certain defects still exist. The key information extraction method based on sequence labeling is more suitable for occasions with short text spans, but a complete result is difficult to obtain when facing subjects with long text spans and subjects. The information extraction model HBT based on machine reading understanding can alleviate the above problems, but is poor in effect when applied directly. In addition, various knowledge types exist in knowledge texts in natural science fields such as artificial intelligence, and it is not realistic to cover and define the relationship types through an exhaustion method, and although an open information extraction form can solve the problem, existing researches focus on open information extraction on contents of a sentence, and most methods are implemented through syntactic analysis through rules predefined by human experts. In practical application, the expression mode of knowledge in the related text content is very changeable, and the knowledge needs to be extracted from the whole section, so that the rule with wide coverage and strong expansibility is very difficult to define, and the problem of insufficient model generalization capability caused by large learning difficulty and less labeling data is faced by extracting in a machine learning mode.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art, and provides a method for extracting key information of documents in the artificial intelligence field, which is used for solving information extraction as a machine reading understanding task and predicting the starting point and the end point of each key information in a text, so that the problem that the performance effect is greatly reduced when a sequence annotation model is applied to a long-span knowledge text is solved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a method for extracting key information of a document in the artificial intelligence field comprises the following steps:
s1, collecting document data in the artificial intelligence field, and then carrying out key information extraction data annotation by utilizing the collected data;
s2, performing further pre-training on a pre-training model RoBERTa in unstructured text in the field of artificial intelligence;
s3, constructing an information extraction model;
s4, initializing backbone network parameters of the information extraction model by using the RoBERTa model obtained by further pre-training;
s5, training by using marked data, randomly replacing the marked data and enhancing the data in the training process, and calculating a counter-propagation error by using square cross entropy loss;
And S6, extracting information in unstructured text in the artificial intelligence field by using the information extraction model obtained through training to obtain a result triplet, and integrating the result triplet.
Further, the step S1 specifically includes:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network science popularization knowledge related to the artificial intelligence field, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triplet to be extracted, specifically:
defining 5 triplet types by adopting a common relation definition method:
entity-description-descriptive content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triplet types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-descriptive content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-inclusion content;
s13, marking the defined triplet types, specifically:
opening a text to be marked in an open source text marking tool coat, selecting a certain section of characters in the text to be marked as a starting entity subject of a certain triplet by using a mouse cursor, clicking and selecting an entity category of the subject in a pop-up selection window, selecting an end entity subject of the triplet in the same way and selecting the category of the object, and finally generating a relation connecting line between the two by using a method of selecting the subject in the triplet by using a mouse and dragging the subject to the object, wherein the category of the relation connecting line is selected in the pop-up selection window to finish marking of the triplet; and repeating the steps until the labeling of all triples in all texts to be labeled is completed.
Further, the RoBERTa model specifically includes three Embedding layers with feature dimensions 756, twelve transform layers with feature dimensions 756, and one fully connected layer with input channels 756 and output channels 756, and the number of output channels is the total number of character types in all training text data;
the three Embedding layers are respectively a Token Embedding layer, a Position Embedding layer and a Segment Embedding layer;
the three Embedding layers map text data input into the model into feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 respectively, and the feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 are obtained by summing the three output feature vectors to be used as the output of the whole of the three Embedding layers and are used as the input of twelve Transformer layers of the RoBERTa model; the twelve Transformer layers of the RoBERTa model output feature vectors in the form of a number of text segments x 512 x 756 input to the model as inputs to the fully connected layer, the fully connected layer output being the probabilistic predictive result of the model for each character in each word replaced by a preset marker symbol in the input text segment for each character in a dictionary that is a collection of all characters of all input training text segment data.
Further, the step S2 specifically includes:
firstly, word segmentation is carried out on training texts by using a jieba word segmentation tool on a pre-training model RoBERTa, and then the pre-training RoBERTa model parameters are used for initializing the RoBERTa model parameters to be trained; then, based on the word segmentation result of the jieba word segmentation tool, a preset mark is adopted to randomly replace part of words in the word segmentation result in each iteration, the processing result is input into a pre-training model RoBERTa, and then the pre-training model RoBERTa is used for predicting the marked and replaced words.
Further, the construction information extraction model specifically includes:
based on the Roberta model, adding a subject prediction module after a layer 10 transducer layer of the Roberta model, adding a feature fusion module after the subject prediction module, and adding a prediction-subject prediction module after the feature fusion module;
the subject prediction module specifically comprises a full-connection layer with the input channel number of 756 and the output channel number of 2, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full-connection layer;
the feature fusion module specifically comprises a full-connection layer with 1512 and 756 input and output channels and a ReLU layer, a Dropout layer and a RoBERTa last two-layer transducer connected with the full-connection layer;
The prediction-object prediction module specifically comprises a full-connection layer with the input channel number of 756 and the output channel number of 2 x the total number of the predicte categories, a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full-connection layer.
Further, the input of the subject prediction module is a feature vector with the shape output by a layer 10 transducer layer of the information extraction model being the number of text segments input to the model multiplied by 512 multiplied by 756, and the output is a probability prediction result corresponding to 512 character positions of the original input information extraction model, wherein each character position is the start point of a subject, and each character position is the end point of the subject;
the feature fusion module is used for carrying out semantic fusion on the features of the subjects and inputting the features of the subjects into the feature vectors output by the layer 10 transducer of the RoBERTa model in the text segment of the information extraction model to obtain the feature vectors of fused subject features, wherein the input of the feature vectors is that the shape output by the layer 10 transducer of the RoBERTa model is the feature vectors of the number of the text segments which are input into the model multiplied by 512 multiplied by 756, and the shape of the feature vectors is the number of the text segments which are input into the model multiplied by 1, and the selected starting point position labeling value and the selected ending point position labeling value of the subjects are respectively input into the model; the selected subjects are obtained by dynamically randomly selecting one from all labeling subjects of each sample in the training text data input to the model at the time of iteration;
During training, a feature fusion module firstly selects vectors at corresponding positions in feature vectors output by RoBERTa layer 10 transformers according to input subject starting points and end points to obtain two vectors with the number of text segments being multiplied by 756, the two vectors are respectively copied for 512 parts to obtain two vectors with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 756, the two vectors are spliced in feature dimensions to obtain a vector with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 1512, the result is input into a full-connection layer network of the feature fusion module to obtain output vectors with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 756, and the output vectors and the feature vectors output by layer 10 transformers are added to obtain the output of the feature fusion module after passing through the two layers of transformers of the feature fusion module;
the input of the prediction-object prediction module is the feature vector of the fused object feature output by the feature fusion module, and the output is the probabilities of each category of the object corresponding to the selected object, the probabilities of each category of the prediction between the selected object and the object, and the probabilities of the start character position and the end character position of the object in each character position of the text segment input to the information extraction model.
Further, in the step S4, the initializing of the backbone network parameter is specifically:
extracting corresponding Embedding layers of the model by using initializing information of each Embedding layer of the RoBERTa model obtained by training, and extracting corresponding Transformer layers of the model by using initializing information of each Transformer layer of the RoBERTa model obtained by training;
the initial parameters of the full-connection layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by randomly sampling in normal distribution of the input channel number of the layer, wherein the average value is 0 and the variance is 2.
Further, the step S5 specifically includes:
s51, randomly replacing and enhancing the labeling data before the model is extracted from the input information to improve the generalization performance of the model and reduce the overfitting;
s52, training by utilizing square two-class cross entropy loss during training, specifically:
the method comprises the steps that before the two-class cross entropy loss calculation is carried out, the Subject prediction probability result and the prediction-object prediction probability result output by a Sigmoid activation function layer are respectively squared;
calculating an error Ls corresponding to the subject predicted result and an error Lpo corresponding to the predicted-object predicted result simultaneously by squaring the two-class cross entropy loss, wherein the final counter-propagating error is as follows:
Loss=k1×Ls+k2×Lpo
Wherein k1 and k2 are selected according to actual conditions;
and S53, performing fine tuning training in the training process, adopting the learning rate of 1e-6 at first, gradually increasing to 5e-5, and finally gradually reducing the learning rate.
Further, in the step S51, the random substitution is specifically:
before data is input into an information extraction model in each iteration of a training process, one entity is randomly replaced with another entity according to a certain probability, one entity attribute is randomly replaced with another entity attribute according to a certain probability, application content is randomly replaced with another application content according to a certain probability, content is randomly replaced with another content according to a certain probability, and a proposer is randomly replaced with another proposer according to a certain probability;
the data enhancement is specifically as follows:
in each iteration of the training process, a word in the descriptive content is randomly replaced, added and deleted before the data is input into the information extraction model.
Further, the step S6 specifically includes:
s61, firstly inputting a text into an information extraction model to obtain a prediction probability result of a start point and an end point of a subject at each position in a text sequence;
taking all the positions with the probability of more than 0.5 of the predicted starting point as the subject starting point positions, and taking all the positions with the probability of more than 0.5 of the predicted ending point positions as the subject predicted ending point positions;
For the starting point position prediction result of each subject, finding the nearest final point position prediction result of one subject which is later than the position in the text, matching the nearest final point position prediction result with the nearest final point position prediction result, and obtaining the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point and the final point position;
s62, combining predicted n matched subject starting point positions and predicted n matched subject end point positions into a batch to obtain an n multiplied by 2 vector;
meanwhile, the 10 th layer of the conversion form output characteristic vector corresponding to the corresponding text is extracted for each pair of subjects to obtain an expanded 10 th layer of conversion form output characteristic vector, and the shape of the expanded 10 th layer of conversion form output characteristic vector is n multiplied by 512 multiplied by 756;
according to the starting point and the end point of each subject, respectively taking out the content of the corresponding position of the output feature vector of the expanded 10 th layer of transformers to respectively obtain a starting point vector of n multiplied by 756 and an end point vector of n multiplied by 756, respectively obtaining vectors of n multiplied by 512 multiplied by 756 and n multiplied by 512 multiplied by 756 by 512, splicing the two vectors in the feature dimension to obtain a feature vector of n multiplied by 512 multiplied by 1512, obtaining a feature vector of n multiplied by 512 multiplied by 756 after the feature vector passes through the full connection layer of the feature fusion module, adding the obtained feature vector with the feature vector obtained by the expanded 10 th layer of transformers, and obtaining the feature vector of the fused subject feature after passing through the two layers of transformers of the feature fusion module;
S63, inputting the feature vector of the fused subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
taking the position with the prediction probability of the starting point of all the categories being more than 0.5 as the position of the starting point of the category, and taking the position with the prediction probability of the ending point of all the categories being more than 0.5 as the position of the ending point of the category;
for the starting point position prediction result of each prediction category, finding the nearest ending point position prediction result of the same prediction category, which is later than the position in the text, to pair with the starting point position prediction result, and according to each pair of the starting point and the ending point position of the prediction, acquiring the content of the corresponding position in the corresponding text as an object result;
s64, for the extracted subject-preject triplet, if the entity attribute is taken as the subject, firstly, finding a certain entity corresponding to the entity attribute as the triplet of the subject, namely, entity attribute-pseudo relation 1-entity, then finding all triples corresponding to the entity attribute and not being the object, namely, entity attribute-pseudo relation 2-descriptive content, entity attribute-pseudo relation 3-application content and entity attribute-pseudo relation 4-containing content, and merging the triples into a new triplet by taking the common entity attribute as the preject after removing the pseudo relation, namely, entity-entity attribute-content, thereby finally realizing open information extraction;
If the extracted triplet does not take the entity attribute as the subject, the extracted triplet is directly taken as a result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the method, information extraction is used as a machine reading understanding task to solve, the starting point and the end point of each key information in the text are predicted, and the problem that the performance effect is greatly reduced when a sequence annotation model is applied to a long-span knowledge text is solved; by using the method, infinite relation types can be extracted, and two information extraction methods, namely closed information extraction method and open information extraction method, are combined into the same frame, so that the accuracy of information extraction is improved.
2. The method provided by the invention has the advantages of a pre-training model, still shows stronger generalization performance under the condition that the labeled sample is not abundant, and can cope with the whole text and changeable knowledge expression forms.
3. The invention improves based on the HBT model, and sets the subject prediction module and the feature fusion module at proper positions, so that the model can maintain proper feature sharing to improve performance and simultaneously solve the problem of negative influence caused by the span difference between subjects and subjects, thereby improving the overall performance of the information extraction model.
4. The method optimizes the model by using square two-class cross entropy loss, plays a role in on-line difficult sample mining, enables the correct selection of the starting point and the finishing point of the model to be more concerned, relieves unbalanced interference of positive and negative samples caused by a large number of negative samples, enlarges the classification boundary of the positive samples, and improves the overall performance of the model; the generalization capability of the model is improved by exchanging entities, entity attributes, application content, inclusion content, generic names, and proposers within random congeners and by randomly operating descriptive content.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of the information extraction model training step of the present invention;
FIG. 3 is a block diagram of a subject prediction module of the information extraction model of the present invention;
FIG. 4 is a block diagram of a feature fusion model of the information extraction model of the present invention;
FIG. 5 is a block diagram of a predictor-object prediction module of the information extraction model of the present invention;
FIG. 6 is a flow chart of the information extraction model reasoning of the present invention;
FIG. 7 is a diagram of a pseudo-relational knowledge integration and merging method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
As shown in FIG. 1, the invention discloses a method for extracting key information of a document in the artificial intelligence field, which comprises the following steps:
s1, collecting document data in the artificial intelligence field, and then utilizing the collected data to extract key information and annotate the data, wherein the method specifically comprises the following steps of:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network science popularization knowledge related to the artificial intelligence field, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triplet to be extracted, specifically:
the common relation definition method is used for defining 5 kinds of triplet types:
entity-description-descriptive content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triplet types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-descriptive content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-containing content.
S13, marking the defined triplet types, specifically:
opening a text to be marked in an open source text marking tool coat, selecting a certain section of characters in the text to be marked as a starting entity subject of a certain triplet by using a mouse cursor, clicking and selecting an entity category of the subject in a pop-up selection window, selecting an end entity subject of the triplet in the same way and selecting the category of the object, and finally generating a relation connecting line between the two by using a method of selecting the subject in the triplet by using a mouse and dragging the subject to the object, wherein the category of the relation connecting line is selected in the pop-up selection window to finish marking of the triplet; and repeating the steps until the labeling of all triples in all texts to be labeled is completed.
S2, further pretraining a pretraining model RoBERTa in unstructured text in the artificial intelligence field in a self-supervision mode, wherein the method specifically comprises the following steps:
firstly, word segmentation is carried out on training texts by using a jieba word segmentation tool on a pre-training model RoBERTa, and then the pre-training RoBERTa model parameters are used for initializing the RoBERTa model parameters to be trained; then, based on the word segmentation result of the jieba word segmentation tool, adopting a preset mark to randomly replace part of words in the word segmentation result in each iteration, inputting the processing result into a pre-training model RoBERTa, and then predicting the marked and replaced words by using the pre-training model RoBERTa; in this embodiment, the preset markers used are: [ MASK ];
the RoBERTa model specifically comprises three Embedding layers with 756 characteristic dimensions, twelve transform layers with 756 characteristic dimensions, and one full-connection layer with 756 input channels and 756 output channels, wherein the number of output channels is the total number of character types in all training text data;
the three Embedding layers are respectively a Token Embedding layer, a Position Embedding layer and a Segment Embedding layer;
the three Embedding layers map text data input into the model into feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 respectively, and the feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 are obtained by summing the three output feature vectors to be used as the output of the whole of the three Embedding layers and are used as the input of twelve Transformer layers of the RoBERTa model; the twelve fransformer layers of the RoBERTa model output feature vectors with the shape of the number of text segments input to the model multiplied by 512 multiplied by 756, and serve as the input of a fully connected layer, and the output of the fully connected layer is the probability prediction result that each character in each word replaced by a preset mark in the input text segment is each character in a dictionary, wherein the dictionary is a set of all characters of all input training text segment data; in this embodiment, the preset marks used are: [ MASK ].
S3, constructing an information extraction model, which specifically comprises the following steps:
based on the Roberta model, adding a subject prediction module after a layer 10 transducer layer of the Roberta model, adding a feature fusion module after the subject prediction module, and adding a prediction-subject prediction module after the feature fusion module;
as shown in fig. 3, the subject prediction module specifically includes a fully-connected layer with an input channel number of 756 and an output channel number of 2, and a ReLU layer, a Dropout layer, and a Sigmoid activation function layer connected to the fully-connected layer;
as shown in fig. 4, the feature fusion module specifically includes a full connection layer with 1512 and 756 input and output channels, and two final layers of transducers connected to the full connection layer, the ReLU layer, the Dropout layer and the RoBERTa;
as shown in fig. 5, the prediction-object prediction module specifically includes a fully-connected layer with an input channel number of 756 and an output channel number of 2×the total number of the predicte categories, and a ReLU layer, a Dropout layer, and a Sigmoid activation function layer connected to the fully-connected layer.
In this embodiment, the subject module uses the feature vector output by the layer 10 transform layer to predict in parallel the probabilities of the start and end points of all subjects in the input text paragraph, where the input is the shape of the layer 10 transform layer output of the information extraction model: feature vectors of the number of text segments x 512 x 756 input to the model are output as a probability prediction result of the start point of the subject at each character position and a probability prediction result of the end point of the subject at each character position at 512 character positions corresponding to the original input information extraction model;
The feature fusion module is used for carrying out semantic fusion on the features of the subjects and inputting the features of the subjects into the feature vectors output by the layer 10 transducer of the RoBERTa model in the text segment of the information extraction model to obtain the feature vectors of fused subject features, wherein the input of the feature vectors is that the shape output by the layer 10 transducer of the RoBERTa model is the feature vectors of the number of the text segments which are input into the model multiplied by 512 multiplied by 756, and the shape of the feature vectors is the number of the text segments which are input into the model multiplied by 1, and the selected starting point position labeling value and the selected ending point position labeling value of the subjects are respectively input into the model; the selected subjects are obtained by dynamically randomly selecting one from all labeling subjects of each sample in a batch at the time of iteration;
during training, as shown in fig. 2, the feature fusion module firstly selects a vector at a corresponding position in a feature vector output by the Roberta layer 10 transducer according to the input subject starting point and the end point position to obtain two vectors with the number of text segments being input to the model being multiplied by 756, copies the two vectors to obtain two vectors with the shape being the number of text segments being input to the model being multiplied by 512 by 756 respectively 512 copies, splices the two vectors in the feature dimension to obtain a vector with the shape being the number of text segments being input to the model being multiplied by 512 by 1512, inputs the result into a full-connection layer network of the feature fusion module to obtain an output vector with the shape being the number of text segments being input to the model being multiplied by 512 by 756, adds the output vector and the feature vector output by the layer 10 transducer, and then passes through two layers of transducers of the feature fusion module to obtain the output of the feature fusion module;
The input of the prediction-object prediction module is the feature vector of the fused object feature output by the feature fusion module, and the output is the probabilities of each category of the object corresponding to the selected object, the probabilities of each category of the prediction between the selected object and the object, and the probabilities of the start character position and the end character position of the object in each character position of the text segment input to the information extraction model.
S4, initializing backbone network parameters by using a RoBERTa model obtained by further pre-training, wherein the method specifically comprises the following steps:
extracting corresponding Embedding layers of the model by using initializing information of each Embedding layer of the RoBERTa model obtained by training, and extracting corresponding Transformer layers of the model by using initializing information of each Transformer layer of the RoBERTa model obtained by training;
the initial parameters of the full-connection layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by randomly sampling in normal distribution of the input channel number of the layer, wherein the average value is 0 and the variance is 2.
S5, training by using marked data, randomly replacing the marked data and enhancing the data in the training process to model generalization performance, reducing overfitting, and calculating counter-propagation errors by using square cross entropy loss, wherein the method specifically comprises the following steps:
S51, randomly replacing and enhancing the labeling data before the model is extracted from the input information to improve the generalization performance of the model and reduce the overfitting;
s52, training by utilizing square two-class cross entropy loss during training, specifically:
the method comprises the steps that before the two-class cross entropy loss calculation is carried out, the Subject prediction probability result and the prediction-object prediction probability result output by a Sigmoid activation function layer are respectively squared;
calculating an error Ls corresponding to the subject predicted result and an error Lpo corresponding to the predicted-object predicted result simultaneously by squaring the two-class cross entropy loss, wherein the final counter-propagating error is as follows:
Loss=k1×Ls+k2×Lpo
wherein k1 and k2 are selected according to actual conditions, and k1 and k2 in the embodiment are 1;
and S53, performing fine tuning training in the training process, adopting the learning rate of 1e-6 at first, gradually increasing to 5e-5, and finally gradually reducing the learning rate.
In this embodiment, the random substitution is specifically:
before data is input into an information extraction model in each iteration of a training process, randomly replacing one entity with another entity according to a certain probability, randomly replacing one entity attribute with another entity attribute according to a certain probability, randomly replacing application content with another application content according to a certain probability, randomly replacing inclusion content with another inclusion content according to a certain probability, and randomly replacing a proposer with another proposer according to a certain probability;
The data enhancement is specifically as follows:
in each iteration of the training process, a word within the descriptive content is randomly replaced, added, and deleted before data is input to the model.
S6, performing information extraction in unstructured text in the artificial intelligence field by using the information extraction model obtained through training to obtain a result triplet, and integrating the result triplet, as shown in FIG. 6, specifically comprising:
s61, firstly inputting a text into an information extraction model to obtain a prediction probability result of a start point and an end point of a subject at each position in a text sequence;
taking all the positions with the probability of more than 0.5 of the predicted starting point as the subject starting point positions, and taking all the positions with the probability of more than 0.5 of the predicted ending point positions as the subject predicted ending point positions;
for the starting point position prediction result of each subject, finding the nearest final point position prediction result of one subject which is later than the position in the text, matching the nearest final point position prediction result with the nearest final point position prediction result, and obtaining the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point and the final point position;
s62, combining predicted n matched subject starting point positions and predicted n matched subject end point positions into a batch to obtain an n multiplied by 2 vector;
Meanwhile, the 10 th layer of the conversion form output characteristic vector corresponding to the corresponding text is extracted for each pair of subjects to obtain an expanded 10 th layer of conversion form output characteristic vector, and the shape of the expanded 10 th layer of conversion form output characteristic vector is n multiplied by 512 multiplied by 756;
according to the starting point and the end point of each subject, respectively taking out the content of the corresponding position of the extended 10 th layer of Transformer output feature vector, respectively obtaining a starting point vector of n×756 and an end point vector of n×756, respectively obtaining vectors of n×512×756 and n×512×756 by copying 512 parts, splicing the vectors in the feature dimension to obtain a feature vector of n×512×1512, obtaining a feature vector of n×512×756 after the feature vector passes through the full connection layer of the feature fusion module, adding the feature vector obtained by the extended 10 th layer of Transformer, and obtaining the feature vector of the fused subject feature after passing through the two layers of Transformer of the feature fusion module;
s63, inputting the feature vector of the fused subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
taking the position with the prediction probability of the starting point of all the categories being more than 0.5 as the position of the starting point of the category, and taking the position with the prediction probability of the ending point of all the categories being more than 0.5 as the position of the ending point of the category;
For the starting point position prediction result of each prediction category, finding the nearest ending point position prediction result of the same prediction category, which is later than the position in the text, to pair with the starting point position prediction result, and according to each pair of the starting point and the ending point position of the prediction, acquiring the content of the corresponding position in the corresponding text as an object result;
s64, for the extracted subject-prediction-object triplet, as shown in FIG. 7, if the entity attribute is taken as the subject, one entity corresponding to the entity attribute is firstly found as the triplet of the object, namely, the entity attribute-pseudo relation 1-entity, then all triples which are not the entity corresponding to the entity attribute and are taken as the object, namely, entity attribute-pseudo relation 2-descriptive content, entity attribute-pseudo relation 3-application content and entity attribute-pseudo relation 4-containing content are found, and then the triples are combined into a new triplet by taking the common entity attribute as the pre after the pseudo relation is removed: entity-entity attribute-content, achieving the purpose of extracting open information;
if the extracted triplet does not take the entity attribute as the subject, the extracted triplet is directly taken as a result.
It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. The method for extracting the key information of the document in the artificial intelligence field is characterized by comprising the following steps of:
s1, collecting document data in the artificial intelligence field, and then carrying out key information extraction data annotation by utilizing the collected data;
s2, performing further pre-training on a pre-training model RoBERTa in unstructured text in the field of artificial intelligence;
s3, constructing an information extraction model;
s4, initializing backbone network parameters of the information extraction model by using the RoBERTa model obtained by further pre-training;
s5, training by using marked data, randomly replacing the marked data and enhancing the data in the training process, and calculating a counter-propagation error by using square cross entropy loss;
S6, information extraction is carried out in unstructured text in the artificial intelligence field by utilizing the information extraction model obtained through training to obtain a result triplet, and the result triplet is integrated;
the RoBERTa model specifically comprises three Embedding layers with 756 characteristic dimensions, twelve transform layers with 756 characteristic dimensions, and one full-connection layer with 756 input channels and 756 output channels, wherein the number of output channels is the total number of character types in all training text data;
the construction information extraction model specifically comprises the following steps:
based on the Roberta model, adding a subject prediction module after a layer 10 transducer layer of the Roberta model, adding a feature fusion module after the subject prediction module, and adding a prediction-subject prediction module after the feature fusion module;
the subject prediction module specifically comprises a full-connection layer with the input channel number of 756 and the output channel number of 2, and a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full-connection layer;
the feature fusion module specifically comprises a full-connection layer with 1512 and 756 input and output channels and a ReLU layer, a Dropout layer and a RoBERTa last two-layer transducer connected with the full-connection layer;
the prediction-object prediction module specifically comprises a full-connection layer with the input channel number of 756 and the output channel number of 2 x the total number of the prediction categories, a ReLU layer, a Dropout layer and a Sigmoid activation function layer which are connected with the full-connection layer;
The input of the subject prediction module is a feature vector with the shape output by a layer 10 transducer layer of the information extraction model as the number of text segments input to the model multiplied by 512 multiplied by 756, and the output is a probability prediction result with each character position being the start point of the subject and a probability prediction result with each character position being the end point of the subject at 512 character positions corresponding to the original input information extraction model;
the feature fusion module is used for carrying out semantic fusion on the features of the subjects and inputting the features of the subjects into the feature vectors output by the layer 10 transducer of the RoBERTa model in the text segment of the information extraction model to obtain the feature vectors of fused subject features, wherein the input of the feature vectors is that the shape output by the layer 10 transducer of the RoBERTa model is the feature vectors of the number of the text segments which are input into the model multiplied by 512 multiplied by 756, and the shape of the feature vectors is the number of the text segments which are input into the model multiplied by 1, and the selected starting point position labeling value and the selected ending point position labeling value of the subjects are respectively input into the model; the selected subjects are obtained by dynamically randomly selecting one from all labeling subjects of each sample in the training text data input to the model at the time of iteration;
during training, a feature fusion module firstly selects vectors at corresponding positions in feature vectors output by RoBERTa layer 10 transformers according to input subject starting points and end points to obtain two vectors with the number of text segments being multiplied by 756, the two vectors are respectively copied for 512 parts to obtain two vectors with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 756, the two vectors are spliced in feature dimensions to obtain a vector with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 1512, the result is input into a full-connection layer network of the feature fusion module to obtain output vectors with the shape of the number of text segments being input to the model multiplied by 512 multiplied by 756, and the output vectors and the feature vectors output by layer 10 transformers are added to obtain the output of the feature fusion module after passing through the two layers of transformers of the feature fusion module;
The input of the prediction-object prediction module is the feature vector of the fused object feature output by the feature fusion module, and the output is the probabilities of each category of the object corresponding to the selected object, the probabilities of each category of the prediction between the selected object and the object, and the probabilities of the start character position and the end character position of the object in each character position of the text segment input to the information extraction model.
2. The method for extracting key information from documents in the artificial intelligence field according to claim 1, wherein the step S1 specifically includes:
s11, collecting unstructured text paragraphs derived from scientific publications, documents and network science popularization knowledge related to the artificial intelligence field, and limiting the length of the text paragraphs to be within 510 characters;
s12, defining the type of the key information triplet to be extracted, specifically:
defining 5 triplet types by adopting a common relation definition method:
entity-description-descriptive content, entity-presenter name, entity-inclusion content, entity-application content, and entity-alias name;
defining 4 triplet types by adopting a pseudo relation definition method:
entity attribute-pseudo relationship 1-entity, entity attribute-pseudo relationship 2-descriptive content, entity attribute-pseudo relationship 3-application content, and entity attribute-pseudo relationship 4-inclusion content;
S13, marking the defined triplet types, specifically:
opening a text to be marked in an open source text marking tool coat, selecting a certain section of characters in the text to be marked as a starting entity subject of a certain triplet by using a mouse cursor, clicking and selecting an entity category of the subject in a pop-up selection window, selecting an end entity subject of the triplet in the same way and selecting the category of the object, and finally generating a relation connecting line between the two by using a method of selecting the subject in the triplet by using a mouse and dragging the subject to the object, wherein the category of the relation connecting line is selected in the pop-up selection window to finish marking of the triplet; and repeating the steps until the labeling of all triples in all texts to be labeled is completed.
3. The method for extracting key information from documents in the artificial intelligence field according to claim 1, wherein the three embedded layers are respectively a Token embedded layer, a Position Embedding layer and a Segment Embedding layer;
the three Embedding layers map text data input into the model into feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 respectively, and the feature vectors with the shapes of the number of text segments input into the model multiplied by 512 multiplied by 756 are obtained by summing the three output feature vectors to be used as the output of the whole of the three Embedding layers and are used as the input of twelve Transformer layers of the RoBERTa model; the twelve Transformer layers of the RoBERTa model output feature vectors in the form of a number of text segments x 512 x 756 input to the model as inputs to the fully connected layer, the fully connected layer output being the probabilistic predictive result of the model for each character in each word replaced by a preset marker symbol in the input text segment for each character in a dictionary that is a collection of all characters of all input training text segment data.
4. The method for extracting key information from documents in the artificial intelligence field according to claim 1, wherein the step S2 is specifically:
firstly, word segmentation is carried out on training texts by using a jieba word segmentation tool on a pre-training model RoBERTa, and then the pre-training RoBERTa model parameters are used for initializing the RoBERTa model parameters to be trained; then, based on the word segmentation result of the jieba word segmentation tool, a preset mark is adopted to randomly replace part of words in the word segmentation result in each iteration, the processing result is input into a pre-training model RoBERTa, and then the pre-training model RoBERTa is used for predicting the marked and replaced words.
5. The method for extracting key information from documents in the artificial intelligence field according to claim 1, wherein the initializing of backbone network parameters in step S4 is specifically:
extracting corresponding Embedding layers of the model by using initializing information of each Embedding layer of the RoBERTa model obtained by training, and extracting corresponding Transformer layers of the model by using initializing information of each Transformer layer of the RoBERTa model obtained by training;
the initial parameters of the full-connection layer in the subject prediction module, the feature fusion module and the prediction-object prediction module are obtained by randomly sampling in normal distribution of the input channel number of the layer, wherein the average value is 0 and the variance is 2.
6. The method for extracting key information from documents in the artificial intelligence field according to claim 1, wherein the step S5 specifically includes:
s51, randomly replacing and enhancing the labeling data before the model is extracted from the input information to improve the generalization performance of the model and reduce the overfitting;
s52, training by utilizing square two-class cross entropy loss during training, specifically:
the method comprises the steps that before the two-class cross entropy loss calculation is carried out, the Subject prediction probability result and the prediction-object prediction probability result output by a Sigmoid activation function layer are respectively squared;
calculating an error Ls corresponding to the subject predicted result and an error Lpo corresponding to the predicted-object predicted result simultaneously by squaring the two-class cross entropy loss, wherein the final counter-propagating error is as follows:
Loss=k1×Ls+k2×Lpo
wherein k1 and k2 are selected according to actual conditions;
and S53, performing fine tuning training in the training process, adopting the learning rate of 1e-6 at first, gradually increasing to 5e-5, and finally gradually reducing the learning rate.
7. The method for extracting key information from documents in the artificial intelligence field according to claim 6, wherein in the step S51, the random substitution is specifically:
Before data is input into an information extraction model in each iteration of a training process, one entity is randomly replaced with another entity according to a certain probability, one entity attribute is randomly replaced with another entity attribute according to a certain probability, application content is randomly replaced with another application content according to a certain probability, content is randomly replaced with another content according to a certain probability, and a proposer is randomly replaced with another proposer according to a certain probability;
the data enhancement is specifically as follows:
in each iteration of the training process, a word in the descriptive content is randomly replaced, added and deleted before the data is input into the information extraction model.
8. The method for extracting key information from documents in the artificial intelligence field according to claim 2, wherein the step S6 specifically includes:
s61, firstly inputting a text into an information extraction model to obtain a prediction probability result of a start point and an end point of a subject at each position in a text sequence;
taking all the positions with the probability of more than 0.5 of the predicted starting point as the subject starting point positions, and taking all the positions with the probability of more than 0.5 of the predicted ending point positions as the subject predicted ending point positions;
for the starting point position prediction result of each subject, finding the nearest final point position prediction result of one subject which is later than the position in the text, matching the nearest final point position prediction result with the nearest final point position prediction result, and obtaining the content of the corresponding position in the text as the subject prediction result according to each pair of the starting point and the final point position;
S62, combining predicted n matched subject starting point positions and predicted n matched subject end point positions into a batch to obtain an n multiplied by 2 vector;
meanwhile, the 10 th layer of the conversion form output characteristic vector corresponding to the corresponding text is extracted for each pair of subjects to obtain an expanded 10 th layer of conversion form output characteristic vector, and the shape of the expanded 10 th layer of conversion form output characteristic vector is n multiplied by 512 multiplied by 756;
according to the starting point and the end point of each subject, respectively taking out the content of the corresponding position of the output feature vector of the expanded 10 th layer of transformers to respectively obtain a starting point vector of n multiplied by 756 and an end point vector of n multiplied by 756, respectively obtaining vectors of n multiplied by 512 multiplied by 756 and n multiplied by 512 multiplied by 756 by 512, splicing the two vectors in the feature dimension to obtain a feature vector of n multiplied by 512 multiplied by 1512, obtaining a feature vector of n multiplied by 512 multiplied by 756 after the feature vector passes through the full connection layer of the feature fusion module, adding the obtained feature vector with the feature vector obtained by the expanded 10 th layer of transformers, and obtaining the feature vector of the fused subject feature after passing through the two layers of transformers of the feature fusion module;
s63, inputting the feature vector of the fused subject feature obtained in the step S62 into a prediction result obtained by a prediction-object prediction module;
Taking the position with the prediction probability of the starting point of all the categories being more than 0.5 as the position of the starting point of the category, and taking the position with the prediction probability of the ending point of all the categories being more than 0.5 as the position of the ending point of the category;
for the starting point position prediction result of each prediction category, finding the nearest ending point position prediction result of the same prediction category, which is later than the position in the text, to pair with the starting point position prediction result, and according to each pair of the starting point and the ending point position of the prediction, acquiring the content of the corresponding position in the corresponding text as an object result;
s64, for the extracted subject-preject triplet, if the entity attribute is taken as the subject, firstly, finding a certain entity corresponding to the entity attribute as the triplet of the subject, namely, entity attribute-pseudo relation 1-entity, then finding all triples corresponding to the entity attribute and not being the object, namely, entity attribute-pseudo relation 2-descriptive content, entity attribute-pseudo relation 3-application content and entity attribute-pseudo relation 4-containing content, and merging the triples into a new triplet by taking the common entity attribute as the preject after removing the pseudo relation, namely, entity-entity attribute-content, thereby finally realizing open information extraction;
if the extracted triplet does not take the entity attribute as the subject, the extracted triplet is directly taken as a result.
CN202110353610.XA 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field Active CN113158674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110353610.XA CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110353610.XA CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Publications (2)

Publication Number Publication Date
CN113158674A CN113158674A (en) 2021-07-23
CN113158674B true CN113158674B (en) 2023-07-25

Family

ID=76886346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110353610.XA Active CN113158674B (en) 2021-04-01 2021-04-01 Method for extracting key information of documents in artificial intelligence field

Country Status (1)

Country Link
CN (1) CN113158674B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297987B (en) * 2022-03-09 2022-07-19 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114861629B (en) * 2022-04-29 2023-04-04 电子科技大学 Automatic judgment method for text style
CN116720502B (en) * 2023-06-20 2024-04-05 中国航空综合技术研究所 Aviation document information extraction method based on machine reading understanding and template rules

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015859B (en) * 2019-05-31 2023-08-18 百度在线网络技术(北京)有限公司 Knowledge hierarchy extraction method and device for text, computer equipment and readable medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images;Zhong Z er.al;IEEE International Conference on Acoustics;第1208-1212页 *
基于Adaboost的手写体数字识别;赵万鹏 等;计算机应用;第25卷(第10期);第576-589页 *
深度学习在手写汉字识别中的应用综述;金连文 等;自动化学报;第42卷(第8期);第1125-1141页 *

Also Published As

Publication number Publication date
CN113158674A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113158674B (en) Method for extracting key information of documents in artificial intelligence field
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN110609983B (en) Structured decomposition method for policy file
CN111462752B (en) Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN115080694A (en) Power industry information analysis method and equipment based on knowledge graph
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN116542817B (en) Intelligent digital lawyer consultation method and system
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
CN113312922A (en) Improved chapter-level triple information extraction method
CN111651569B (en) Knowledge base question-answering method and system in electric power field
CN113868380A (en) Few-sample intention identification method and device
CN114004231A (en) Chinese special word extraction method, system, electronic equipment and storage medium
CN116383395A (en) Method for constructing knowledge graph in hydrologic model field
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN110110137A (en) Method and device for determining music characteristics, electronic equipment and storage medium
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment
CN113051886A (en) Test question duplicate checking method and device, storage medium and equipment
CN112749566B (en) Semantic matching method and device for English writing assistance
CN116628151A (en) Question-answering system and method based on Ling nan building knowledge graph
CN115203429B (en) Automatic knowledge graph expansion method for constructing ontology framework in auditing field
CN114943235A (en) Named entity recognition method based on multi-class language model
CN109885827B (en) Deep learning-based named entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant