CN114817564A - Attribute extraction method and device and storage medium - Google Patents

Attribute extraction method and device and storage medium Download PDF

Info

Publication number
CN114817564A
CN114817564A CN202210458635.0A CN202210458635A CN114817564A CN 114817564 A CN114817564 A CN 114817564A CN 202210458635 A CN202210458635 A CN 202210458635A CN 114817564 A CN114817564 A CN 114817564A
Authority
CN
China
Prior art keywords
text
vector representation
word
attribute
global vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210458635.0A
Other languages
Chinese (zh)
Inventor
陈文亮
张世奇
周夏冰
张民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210458635.0A priority Critical patent/CN114817564A/en
Publication of CN114817564A publication Critical patent/CN114817564A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention converts the attribute extraction task into a segment extraction type reading understanding task, and adopts a multi-task model of combined training of attribute extraction and text attribute judgment. The model takes BERT-B i-LSTM as a coding module, codes input texts and questions respectively, and takes structured information as the questions to enhance the generalization capability of the model. Then, a word boundary feature enhancement method is used to help the model capture the boundary features of the attribute values, and the word features are merged on the basis of the global vector features in combination with a multi-head attention mechanism. Meanwhile, a text feature interaction method is designed for judging whether an attribute value corresponding to a problem exists in a text or not, and the method is used as an auxiliary task and an attribute value boundary prediction task for joint training.

Description

Attribute extraction method and device and storage medium
Technical Field
The present invention relates to the field of technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for extracting attributes.
Background
In the prior art, at present, fields such as e-commerce, movie and television, medical treatment and the like are designed to construct a high-quality knowledge graph in the field, an attribute extraction task is one of important links for constructing the knowledge graph, and the task is oriented to unstructured texts in the vertical field and aims to extract attributes and attribute values related to entities. Taking e-commerce data as an example, given a commodity category "jacket" and a description text "foreign trade man autumn hood-linked excellent splicing lappet hip-hop jacket big code jacket", the goal is to extract attributes and attribute values related to the "jacket" from the description text, such as "material-lappet", "style-hip-hop", "material" and "style" are attributes of the "jacket", and "lappet" and "hip-hop" are corresponding attribute values. The attribute extraction task can improve the integrity of the entity node expression of the knowledge graph and enhance the interactive experience of the user and the knowledge graph.
Existing attribute extraction methods are mainly classified into rule-based methods, traditional machine learning-based methods, and deep learning-based methods. The rule-based method requires the manual construction of a rule template related to the field, and the constructed rule template is utilized to match attributes and attribute values corresponding to the entity in the natural language text. The method is based on the problem of poor applicability when the rules related to the field are formulated in a single field. When the rule size becomes large, the whole rule set is difficult to maintain, and once the text which cannot be covered by the new rule appears, additional rule design is needed, and the process is time-consuming and labor-consuming. The method based on traditional machine learning generally utilizes a supervised learning strategy, and needs a large amount of labeled corpus training models so that the models can fully learn the attribute characteristics contained in the data.
In recent years, deep learning methods have been widely used in various information extraction tasks in natural language processing, and have achieved good results in tasks such as named entity recognition, event extraction, relationship extraction, and entity-relationship joint extraction. For example, a Recurrent Neural Network (RNN), a Long Short-Term Memory (LSTM) Network, and a Gated Network (GRU) Network are highlighted for convenience in extracting information from natural text. In addition, some researchers combine the attention mechanism with BilSTM-CRF to capture the inherent semantic relation of the commodity title, so that the model can better extract the attribute and the attribute value corresponding to the commodity title. Currently, the pre-training language model becomes a mainstream coder for information extraction tasks such as attribute extraction and the like, such as BERT, ALBERT, RoBERTa, elettra, XLNet and the like, by virtue of excellent coding capability.
The prior art related to attribute extraction has the following disadvantages:
1. the feature extraction method based on the artificial template is time-consuming and labor-consuming for manually filtering data, and the extraction quality is difficult to ensure based on a fuzzy matching mode. The template constructed based on expert knowledge has high cost, and the coverage of the template is limited, so that the template cannot be flexibly applied. This approach does not allow for extraction of new attribute values that do not appear in the data.
2. The feature extraction method based on the bidirectional long-short term memory neural network is difficult to solve the problem of long-distance dependence and easy to cause information loss.
3. The method based on the pre-training language model does not fully consider vocabulary information, and has the problems that the model is difficult to judge entity boundaries and the generalization capability is insufficient.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the problems of long-distance dependence and insufficient generalization capability in the prior art.
In order to solve the above technical problem, the present invention provides an attribute extraction method, device, apparatus and computer storage medium, including:
inputting the preprocessed problem and the text into a pre-trained attribute extraction model, wherein the problem is that the triple after the MASK marks replace head and tail entities, namely structured information;
calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and encoding the first text global vector representation through a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
interacting the second text global vector representation and the problem global vector representation by using a multi-head attention mechanism to obtain a text global vector representation with problem structural information generalization characteristics;
inputting the text into an automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
adding word segmentation vector representation at the corresponding position of the text global vector representation with problem structured information generalization characteristics according to the absolute position index of the word head-tail label in the word segmentation result to obtain final text vector representation;
and predicting the attribute value boundary to be extracted in the final text vector representation to obtain a target attribute value.
Preferably, the calculating by using the BERT model to obtain the problem global vector representation and the first text global vector representation, and the encoding the first text global vector representation by the Bi-directional long-short term memory layer Bi-LSTM to obtain the second text global vector representation comprises:
segmenting the question Q and the text S into words, each word being represented by a tagged word vector TE (w) i ) A word vector SE (w) distinguishing two different sentences i ) And a location word vector PE (w) i ) Composing to obtain a vector representation of the question and the text;
inputting the vector representation of the question and text into the BERT model to obtain the coded global vector representation of the question
Figure BDA0003599299780000031
And the first text global vector representation
Figure BDA0003599299780000032
Figure BDA0003599299780000033
Wherein
Figure BDA0003599299780000034
Is BThe vector representation of each character in the post-ERT encoded problem,
Figure BDA0003599299780000035
vector representation for each character in the text after BERT coding;
representing X for the first text global vector using the Bi-directional long-short term memory layer Bi-LSTM s Coding to obtain the global vector representation of the second text
Figure BDA0003599299780000036
Wherein
Figure BDA0003599299780000037
A vector representation for each character in the Bi-LSTM encoded text.
Preferably, the first text global vector representation X is represented by using the bidirectional long-short term memory layer Bi-LSTM s Coding to obtain the global vector representation of the second text
Figure BDA0003599299780000038
The method comprises the following steps:
computing the first text global vector X s To obtain a coded global vector representation of said second text
Figure BDA0003599299780000039
The hidden state o at each time i i Hidden state by forward LSTM
Figure BDA00035992997800000310
Hidden state of backward LSTM
Figure BDA00035992997800000311
And splicing to obtain the following calculation formula:
Figure BDA00035992997800000312
Figure BDA00035992997800000313
Figure BDA00035992997800000314
preferably, the adding the word segmentation vector representation to the corresponding position of the text global vector representation having the problem structured information generalization feature according to the absolute position index of the word head-tail label in the word segmentation result to obtain the final text vector representation includes:
the word position in the text word segmentation result is represented as:
P[a i ,t i ]={p 1 [a 1 ,t 1 ],p 2 [a 2 ,t 2 ]…p n [a n ,t n ]in which a is i 、t i An absolute position index, p, representing the head and tail labels of each word in the text, respectively n Represents the nth word;
in the text global vector representation with problem structured information generalization characteristics
Figure BDA0003599299780000041
Figure BDA0003599299780000042
Adding the word segmentation vector representation V containing word time sequence characteristics after Bi-LSTM and normalization into the corresponding position to obtain the final text vector representation H v
Preferably, the predicting the attribute value boundary to be extracted in the final text vector representation, and obtaining the target attribute value includes:
and respectively predicting the probability of each word in the final text vector as a starting position s and an ending position e by adopting two linear layers:
s i =sigmoid(FNN(H v ))
e i =sigmoid(FNN(H v )
wherein s is i Representing the probability of the ith word of text as the starting position of the attribute value, e i Representing the probability that the ith character of the text is used as the ending position of the attribute value;
and taking the starting position and the corresponding ending position as the coordinates of the target attribute value.
Preferably, the training process of the attribute extraction model includes an attribute value boundary prediction task, and the specific steps of the attribute value boundary prediction task are as follows:
constructing a corresponding training set;
training a model using the training set until a loss function converges, the loss function including a loss with each word as a starting position s And loss of termination location e
Figure BDA0003599299780000043
Figure BDA0003599299780000044
Wherein the content of the first and second substances,
Figure BDA0003599299780000051
and
Figure BDA0003599299780000052
is a boundary representation of the true attribute values.
Preferably, the training process of the attribute extraction model includes a text attribute type classification task, and the text attribute type classification task includes the specific steps of:
representing the texts and the problems in the training set in the CLS Token output of the BRET model
Figure BDA0003599299780000053
And
Figure BDA0003599299780000054
as textA feature representation and an attribute type feature representation;
by using a multi-head attention mechanism
Figure BDA0003599299780000055
And
Figure BDA0003599299780000056
interacting to obtain comprehensive classification characteristics h Att
And training and judging whether the attribute value to be extracted related to the attribute type to be extracted exists in the text according to the comprehensive classification characteristic by using a classifier so as to enable the model to pay more attention to the attribute value related to the attribute to be extracted in the text, wherein the loss function is as follows:
Figure BDA0003599299780000057
wherein, y j Representing true class true value, P j Indicating a predicted value for the jth class attribute type.
The invention also provides a device for extracting the attributes, which comprises:
the input module is used for inputting the preprocessed problem and the text into a pre-trained attribute extraction model, wherein the problem is that the triple after the MASK marks replace head and tail entities, namely structured information;
the coding module is used for calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and coding the first text global vector representation through a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
the interaction module is used for interacting the second text global vector representation with the problem global vector representation by utilizing a multi-head attention mechanism to obtain a text global vector representation with problem structural information generalization characteristics;
the word segmentation module is used for inputting the text into the automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
the word boundary enhancing module is used for adding word segmentation vector representation to the corresponding position of the text global vector representation with the problem structured information generalization characteristics according to the absolute position index of the word head tag and the word tail tag in the word segmentation result to obtain final text vector representation;
and the extraction module is used for predicting the boundary of the attribute value to be extracted in the final text vector representation to obtain a target attribute value.
The invention also provides a device for extracting the attributes, which comprises:
a memory for storing a computer program; and the processor is used for realizing the step of extracting the attributes when executing the computer program.
The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of attribute extraction as described above.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the attribute extraction task is converted into a segment extraction type reading understanding task, the model takes BERT-Bi-LSTM as a coding module, the input text and the problem are coded respectively, head and tail entities in the triple are marked by adopting a mask label, the label is not exposed when context information is used, the structured information is used as the problem, and the generalization capability of the model is enhanced; the bidirectional long and short term memory layer can integrate forward and backward information of the text, so that the model fully captures the time sequence characteristics and semantic characteristics of the text; and a word boundary characteristic enhancement method is used for helping the model to capture the boundary characteristics of the attribute values, adding complete vocabulary information at the starting position and the ending position of the vocabulary based on the word segmentation result, and then utilizing a multi-head attention mechanism to enable the word segmentation vectors to interact with the global characteristics of the text so as to be fused into the vocabulary information. The word boundary characteristics strengthen the judgment of the model on the head and tail positions of the attribute values, deepen the control of the model on the attribute value boundaries, help the model to understand sentence structures and help the model to identify more unknown words. The invention utilizes the word boundary information and the text attribute characteristics to relieve the problems that no answer data in the machine reading understanding model is difficult to utilize and the unknown words in the attribute extraction task are difficult to extract, thereby effectively improving the effect of entity attribute extraction.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of an implementation of the attribute extraction of the present invention;
FIG. 2 is a detailed block diagram of the vocabulary enhancement module of the present invention;
FIG. 3 is an overall block diagram of the attribute extraction model of the present invention;
fig. 4 is a block diagram of an attribute extraction apparatus according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a method, a device, equipment and a computer storage medium for extracting attributes, fully acquire context information and improve the generalization capability of a model.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of attribute extraction according to the present invention; the specific operation steps are as follows:
s101, inputting the preprocessed problem and the text into a pre-trained attribute extraction model, wherein the problem is a triple after MASK marks replace head and tail entities, and the triple is structured information;
the questions and texts are preprocessed into the forms of "[ CLS ] + text + [ SEP ]", "[ CLS ] + question + [ SEP ]", before being input;
triples are head entity-attribute-tail entity (attribute value);
the question is marked with MASK to replace the head and tail entities with "[ CLS ] + [ MASK ] attribute [ MASK ] + [ SEP ]".
S102: calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and encoding the first text global vector representation through a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
the BERT adopts a stacked 12-layer same transform Encoder architecture, can acquire global features of a text from different angles, each layer consists of two substructures of a multi-head attention layer and a feedforward neural network, and the output of each substructure is respectively subjected to residual connection and layer regularization (layer normal). The multi-head attention can obtain self-attention vectors of a plurality of subspaces, and the vectors of all the subspaces are spliced to obtain the output of the multi-head attention. U denotes a multi-head attention vector, and denotes an attention vector of a subspace. The feed-forward neural network projects the output of the multi-head attention layer.
U=Concat(u 1 ,u 2 …u h )W u
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2
Segmenting the question Q and the text S, outputting words after segmenting the Chinese, converting the words into corresponding IDs (identities) as input of the model, wherein each word is a marked word vector TE (w) i ) A word vector SE (w) distinguishing two different sentences i ) And a location word vector PE (w) i ) Composition, i.e. E (w) i )=TE(w i )+SE(w i )+PE(w i ) Obtaining a vector representation of the question and the text;
inputting the vector representation of the question and text into the BERT model to obtain the encoded global vector representation of the question
Figure BDA0003599299780000081
And the first text global vector representation
Figure BDA0003599299780000082
Figure BDA0003599299780000083
Wherein
Figure BDA0003599299780000084
For the vector representation of each character in the post BERT encoded problem,
Figure BDA0003599299780000085
vector representation for each character in the text after BERT coding;
considering that the recurrent neural network is suitable for modeling serialized data, the problems of gradient disappearance or gradient explosion exist in the training process, and the long-distance dependence is difficult to process. Therefore, the Bi-LSTM is adopted to further integrate the text time sequence characteristics. Representing X for the first text global vector using the Bi-directional long-short term memory layer Bi-LSTM s Coding to obtain the global vector representation of the second text
Figure BDA0003599299780000086
Wherein
Figure BDA0003599299780000087
Vector representation for each character in the Bi-LSTM encoded text;
computing the first text global vector X s To obtain a coded global vector representation of said second text
Figure BDA0003599299780000088
The hidden state o at each time i i Hidden state by forward LSTM
Figure BDA0003599299780000089
Hidden state of backward LSTM
Figure BDA00035992997800000810
And splicing to obtain the following calculation formula:
Figure BDA00035992997800000811
Figure BDA00035992997800000812
Figure BDA00035992997800000813
s103, interacting the second text global vector representation and the problem global vector representation by using a multi-head attention mechanism to obtain a text global vector representation with problem structural information generalization characteristics;
s104, inputting the text into an automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
s105, adding word segmentation vector representation to the corresponding position of the text global vector representation with the problem structured information generalization characteristics according to the absolute position index of the word head-tail label in the word segmentation result to obtain final text vector representation;
the word position in the text word segmentation result is represented as:
[a i ,t i ]={p 1 [a 1 ,t 1 ],p 2 [a 2 ,t 2 ]…p n [a n ,t n ]in which a is i 、t i An absolute position index, p, representing the head and tail labels of each word in the text, respectively n Represents the nth word;
in the text global vector representation with problem structured information generalization characteristics
Figure BDA0003599299780000091
Figure BDA0003599299780000092
Is added into the corresponding position and is normalized by Bi-LSTMThe word segmentation vector representation V containing word time sequence characteristics obtains a final text vector representation H v
The decoding operation of the model relies on locating the start and end positions of the attribute values to complete the extraction of the attribute values. Therefore, the word boundary characteristic enhancement method adopts the characteristic of strengthening the head and tail boundary of the attribute value to help the model to judge the boundary of the attribute value. The vocabulary enhancement module aims at integrating vocabulary information on the basis of BERT-Bi-LSTM coding and improving the grasp of the model on the attribute value boundary. The vocabulary enhancement module consists of two parts, namely word boundary characteristic enhancement and text-word segmentation information interaction. The word boundary characteristic enhancement method is based on word segmentation results, and complete word information is added at the starting position and the ending position of words. The text-word segmentation information interaction mainly utilizes a multi-head attention mechanism to enable word segmentation vectors to interact with the global features of the text so as to integrate vocabulary information.
And S106, predicting the attribute value boundary to be extracted in the final text vector representation to obtain a target attribute value.
And respectively predicting the probability of each word in the final text vector as a starting position s and an ending position e by adopting two linear layers, wherein the probability value is closer to 1, and the probability that the word is taken as the starting position or the ending position is higher:
s i =sigmoid(FNN(H v ))
e i =sigmoid(FNN(H v )
wherein s is i Representing the probability of the ith word of text as the starting position of the attribute value, e i The probability of the ith character of the text as the ending position of the attribute value is represented, the probability is calculated through full connection layers with different parameters and the same structure, each character is independently classified to judge whether the character is the initial position, a plurality of initial positions can be obtained, and a plurality of ending positions can be obtained in the same way;
and taking the starting position and the corresponding ending position as the coordinates of the target attribute value.
The attribute extraction task is converted into a segment extraction type reading understanding task, the model takes BERT-Bi-LSTM as a coding module, the input text and the problem are coded respectively, head and tail entities in the triple are marked by adopting a mask label, the label is not exposed when context information is used, the structured information is used as the problem, and the generalization capability of the model is enhanced; the bidirectional long and short term memory layer can integrate forward and backward information of the text, so that the model fully captures the time sequence characteristics and semantic characteristics of the text; and a word boundary characteristic enhancement method is used for helping the model to capture the boundary characteristics of the attribute values, adding complete vocabulary information at the starting position and the ending position of the vocabulary based on the word segmentation result, and then utilizing a multi-head attention mechanism to enable the word segmentation vectors to interact with the global characteristics of the text so as to be fused into the vocabulary information. The word boundary characteristics strengthen the judgment of the model on the head and tail positions of the attribute values, deepen the control of the model on the attribute value boundaries, help the model to understand sentence structures and help the model to identify more unknown words. The invention utilizes the word boundary information and the text attribute characteristics to relieve the problems that no answer data in the machine reading understanding model is difficult to utilize and the unknown words in the attribute extraction task are difficult to extract, thereby effectively improving the effect of entity attribute extraction.
Based on the above embodiments, the present embodiment further describes in detail the word enhancing module of S104-S105, specifically as follows:
the segmented word vector input Bi-LSTM is normalized to obtain a segmented word vector representation V containing word time sequence characteristics. And then adding vector representation of word information word boundary feature enhancement at the corresponding position of the global feature vector based on the absolute position index of the head and tail labels of the words.
Taking fig. 2 as an example, the input text is "beige sweater", and the text is first input and encoded by the BERT-Bi-LSTM module. Meanwhile, the text is segmented by using LTP, and segmentation results { "beige", "sweater" } and corresponding segmentation vector representations V ═ V { (V) are obtained 1 ,v 2 }. Then, to facilitate determining the actual position of the lexical boundaries in the text, an absolute positional representation P [ s ] of each word is derived based on the segmentation results i ,e i ]={p 1 [1,2],p 2 [3,5]}. Finally, adding completeness to the corresponding text feature vector based on the absolute position of the boundary of the wordTo obtain a final text vector representation
Figure BDA0003599299780000111
Based on the above embodiments, this embodiment further describes in detail the training process of the model of the present invention, which is specifically as follows:
the training attribute extraction model comprises an attribute value boundary prediction task:
cleaning corpora and constructing a corresponding training set;
training a model using the training set until a loss function converges, the loss function including a loss with each word as a starting position s And loss of termination location e
Figure BDA0003599299780000112
Figure BDA0003599299780000113
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003599299780000114
and
Figure BDA0003599299780000115
is a boundary representation of the true attribute values.
Since the prior art fails to combine external knowledge to enhance the understanding of the attribute types by the model, we train the attribute extraction model to also include the text attribute type classification task:
representing the texts and the problems in the training set in the CLS Token output of the BRET model
Figure BDA0003599299780000116
And with
Figure BDA0003599299780000117
As text feature representations and attribute classesA type feature representation;
by using a multi-head attention mechanism
Figure BDA0003599299780000118
And
Figure BDA0003599299780000119
interacting to obtain comprehensive classification characteristics h Att
And training and judging whether the attribute value to be extracted related to the attribute type to be extracted exists in the text according to the comprehensive classification characteristic by using a classifier so as to enable the model to pay more attention to the attribute value related to the attribute to be extracted in the text, wherein the loss function is as follows:
Figure BDA00035992997800001110
wherein, y j Representing true class true value, P j A predictor representing a class j attribute type.
Performing joint training by using an attribute value boundary prediction task and a text attribute type classification task, wherein the Loss is Loss class +loss s +loss e
Model (as shown in fig. 3) prediction is performed on a test set, and by comparing with other baseline models, the method based on word boundary feature enhancement is found to achieve the best effect, and the generalization capability of the model is obviously improved.
The invention converts the attribute extraction task into a segment extraction type reading understanding task, and adopts a multi-task model of combined training of attribute extraction and text attribute judgment. The model takes BERT-Bi-LSTM as a coding module, codes input texts and questions respectively, and takes structured information as the questions to enhance the generalization capability of the model. Then, a word boundary feature enhancement method is used to help the model capture the boundary features of the attribute values, and the word features are merged on the basis of the global vector features in combination with a multi-head attention mechanism. Meanwhile, a text feature interaction method is designed for judging whether an attribute value corresponding to a problem exists in a text or not, and the method is used as an auxiliary task and an attribute value boundary prediction task for combined training. On one hand, the word boundary characteristics strengthen the judgment of the model on the head and tail positions of the attribute values, deepen the control of the model on the attribute value boundaries and help the model to identify more unknown words; on the other hand, the method is combined with a text attribute feature perception task auxiliary model, the model is further helped to improve the sensitivity degree of the model to the attribute type, the problem that the model has insufficient understanding on the attribute type is solved, and the model can pay more attention to the attribute value related to the attribute to be extracted in the text. In conclusion, the overall performance of the attribute extraction system is beneficially improved.
Referring to fig. 4, fig. 4 is a block diagram of an attribute extraction apparatus according to an embodiment of the present invention; the specific device may include:
the input module 100 is configured to input the preprocessed problem and the text into a pre-trained attribute extraction model, where the problem is a triple after a MASK mark replaces a head-tail entity, and the triple is structured information;
the encoding module 200 is used for calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and encoding the first text global vector representation by a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
an interaction module 300, configured to use a multi-head attention mechanism to interact the second text global vector representation with the question global vector representation to obtain a text global vector representation with a problem structured information generalization feature;
a word segmentation module 400, configured to input the text into an automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
a word boundary enhancing module 500, configured to add word segmentation vector representations to corresponding positions of the text global vector representation having the problem structured information generalization feature according to the absolute position indexes of the word head and tail labels in the word segmentation result, so as to obtain final text vector representations;
and the extraction module 600 is configured to predict the boundary of the attribute value to be extracted in the final text vector representation, and obtain a target attribute value.
The machine vision-based surface defect detection apparatus of this embodiment is used to implement the foregoing attribute extraction method, and therefore a specific implementation of the attribute extraction apparatus may be found in the foregoing embodiment parts of the attribute extraction method, for example, the input module 100, the encoding module 200, the interaction module 300, the word segmentation module 400, the word boundary enhancement module 500, and the extraction module 600 are respectively used to implement steps S101, S102, S103, S104, S105, and S106 in the foregoing attribute extraction method, so that the specific implementation thereof may refer to descriptions of corresponding respective part embodiments, and will not be described herein again.
The specific embodiment of the present invention further provides an attribute extraction device, including: a memory for storing a computer program; and the processor is used for realizing the steps of the attribute extraction method when the computer program is executed.
The specific embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned attribute extraction method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. An attribute extraction method, comprising:
inputting the preprocessed problem and the text into a pre-trained attribute extraction model, wherein the problem is that the triple after the MASK marks the head entity and the tail entity is the structured information;
calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and encoding the first text global vector representation through a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
interacting the second text global vector representation and the problem global vector representation by using a multi-head attention mechanism to obtain a text global vector representation with problem structural information generalization characteristics;
inputting the text into an automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
adding word segmentation vector representation at the corresponding position of the text global vector representation with problem structured information generalization characteristics according to the absolute position index of the word head-tail label in the word segmentation result to obtain final text vector representation;
and predicting the attribute value boundary to be extracted in the final text vector representation to obtain a target attribute value.
2. The method of extracting attributes as claimed in claim 1, wherein the calculating a problem global vector representation and a first text global vector representation by using a BERT model, and encoding the first text global vector representation by a Bi-directional long short term memory layer Bi-LSTM to obtain a second text global vector representation comprises:
segmenting the question Q and the text S into words, each word being represented by a tagged word vector TE (w) i ) A word vector SE (w) distinguishing two different sentences i ) And a location word vector PE (w) i ) Composing to obtain a vector representation of the question and the text;
inputting the vector representation of the question and text into the BERT model to obtain the encoded global vector representation of the question
Figure FDA0003599299770000021
And the first text global vector representation
Figure FDA0003599299770000022
Wherein
Figure FDA0003599299770000023
For the vector representation of each character in the post BERT encoded problem,
Figure FDA0003599299770000024
vector representation for each character in the text after BERT coding;
representing X for the first text global vector using the Bi-directional long-short term memory layer Bi-LSTM s Coding to obtain the global vector representation of the second text
Figure FDA0003599299770000025
Wherein
Figure FDA0003599299770000026
A vector representation for each character in the Bi-LSTM encoded text.
3. The method of claim 2, wherein the first text global vector representation X is represented by using the Bi-directional long-short term memory layer Bi-LSTM s Coding to obtain the global vector representation of the second text
Figure FDA0003599299770000027
The method comprises the following steps:
calculating the first text global vector X s To obtain a coded global vector representation of said second text
Figure FDA0003599299770000028
The hidden state o at each time i i Hidden state by forward LSTM
Figure FDA0003599299770000029
And hidden state of backward LSTM
Figure FDA00035992997700000210
And splicing to obtain the following calculation formula:
Figure FDA00035992997700000211
Figure FDA00035992997700000212
Figure FDA00035992997700000213
4. the method according to claim 1, wherein the adding the word segmentation vector representation to the corresponding position of the global text vector representation having the generalization feature of the problem structured information according to the absolute position index of the head and tail labels of the words in the word segmentation result to obtain the final text vector representation comprises:
the word position in the text word segmentation result is represented as:
P[a i ,t i ]={p 1 [a 1 ,t 1 ],p 2 [a 2 ,t 2 ]...p n [a n ,t n ]in which a is i 、t i An absolute position index, p, representing the head-to-tail label of each word in the text n Represents the nth word;
in the text global vector representation with problem structured information generalization characteristics
Figure FDA00035992997700000214
Figure FDA00035992997700000215
Adding the word segmentation vector representation V containing word time sequence characteristics after Bi-LSTM and normalization into the corresponding position to obtain the final text vector representation H v
5. The method of claim 4, wherein the predicting the boundary of the attribute value to be extracted in the final text vector representation and obtaining the target attribute value comprises:
and respectively predicting the probability of each word in the final text vector as a starting position s and an ending position e by adopting two linear layers:
s i =sigmoid(FNN(H v ))
e i =sigmoid(FNN(H v )
wherein s is i Representing the probability of the ith word of text as the starting position of the attribute value, e i Representing the probability that the ith character of the text is used as the ending position of the attribute value;
and taking the starting position and the corresponding ending position as the coordinates of the target attribute value.
6. The attribute extraction method according to claim 5, wherein the training process of the attribute extraction model includes an attribute value boundary prediction task, and the attribute value boundary prediction task includes the following specific steps:
constructing a corresponding training set;
training a model using the training set until a loss function converges, the loss function including a loss with each word as a starting position s And loss of termination location e
Figure FDA0003599299770000031
Figure FDA0003599299770000032
Wherein the content of the first and second substances,
Figure FDA0003599299770000033
and
Figure FDA0003599299770000034
is a boundary representation of the true attribute values.
7. The method for extracting attributes of claim 6, wherein the training process of the attribute extraction model includes a text attribute type classification task, and the text attribute type classification task includes the following specific steps:
representing the texts and the problems in the training set in the CLS Token output of the BRET model
Figure FDA0003599299770000041
And
Figure FDA0003599299770000042
as a text feature representation and an attribute type feature representation;
by using a multi-head attention mechanism
Figure FDA0003599299770000043
And
Figure FDA0003599299770000044
interacting to obtain comprehensive classification characteristics h Att
And training and judging whether the attribute value to be extracted related to the attribute type to be extracted exists in the text according to the comprehensive classification characteristic by using a classifier so as to enable the model to pay more attention to the attribute value related to the attribute to be extracted in the text, wherein the loss function is as follows:
Figure FDA0003599299770000045
wherein, y j Representing true class true value, P j Indicating a predicted value for the jth class attribute type.
8. An apparatus for attribute extraction, comprising:
the input module is used for inputting the preprocessed problem and the text into a pre-trained attribute extraction model, wherein the problem is that the triple after the MASK marks replace head and tail entities, namely structured information;
the coding module is used for calculating by using a BERT model to obtain problem global vector representation and first text global vector representation, and coding the first text global vector representation through a Bi-directional long-short term memory layer Bi-LSTM to obtain second text global vector representation;
the interaction module is used for interacting the second text global vector representation with the problem global vector representation by utilizing a multi-head attention mechanism to obtain a text global vector representation with problem structural information generalization characteristics;
the word segmentation module is used for inputting the text into an automatic word segmentation tool to obtain a word segmentation result and a word segmentation vector representation of the text;
the word boundary enhancing module is used for adding word segmentation vector representation to the corresponding position of the text global vector representation with the problem structured information generalization characteristics according to the absolute position index of the word head tag and the word tail tag in the word segmentation result to obtain final text vector representation;
and the extraction module is used for predicting the boundary of the attribute value to be extracted in the final text vector representation to obtain a target attribute value.
9. An apparatus for attribute extraction, comprising:
a memory for storing a computer program;
a processor for implementing the steps of a method of attribute extraction as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of a method of attribute extraction as claimed in any one of the claims 1 to 7.
CN202210458635.0A 2022-04-15 2022-04-15 Attribute extraction method and device and storage medium Pending CN114817564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210458635.0A CN114817564A (en) 2022-04-15 2022-04-15 Attribute extraction method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210458635.0A CN114817564A (en) 2022-04-15 2022-04-15 Attribute extraction method and device and storage medium

Publications (1)

Publication Number Publication Date
CN114817564A true CN114817564A (en) 2022-07-29

Family

ID=82509632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210458635.0A Pending CN114817564A (en) 2022-04-15 2022-04-15 Attribute extraction method and device and storage medium

Country Status (1)

Country Link
CN (1) CN114817564A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116245078A (en) * 2022-11-30 2023-06-09 荣耀终端有限公司 Structured information extraction method and electronic equipment
CN116756624A (en) * 2023-08-17 2023-09-15 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116245078A (en) * 2022-11-30 2023-06-09 荣耀终端有限公司 Structured information extraction method and electronic equipment
CN116756624A (en) * 2023-08-17 2023-09-15 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing
CN116756624B (en) * 2023-08-17 2023-12-12 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
Yang et al. Multimodal sentiment analysis with unidirectional modality translation
CN113448477B (en) Interactive image editing method and device, readable storage medium and electronic equipment
CN111680484B (en) Answer model generation method and system for visual general knowledge reasoning question and answer
CN113268586A (en) Text abstract generation method, device, equipment and storage medium
CN114817564A (en) Attribute extraction method and device and storage medium
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114997181A (en) Intelligent question-answering method and system based on user feedback correction
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115311465A (en) Image description method based on double attention models
CN113705207A (en) Grammar error recognition method and device
CN112836062A (en) Relation extraction method of text corpus
CN116341519A (en) Event causal relation extraction method, device and storage medium based on background knowledge
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN116402066A (en) Attribute-level text emotion joint extraction method and system for multi-network feature fusion
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN116561305A (en) False news detection method based on multiple modes and transformers
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN115577072A (en) Short text sentiment analysis method based on deep learning
CN114580397A (en) Method and system for detecting < 35881 > and cursory comments
CN113642630A (en) Image description method and system based on dual-path characteristic encoder
CN113255360A (en) Document rating method and device based on hierarchical self-attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination