CN115688752A - Knowledge extraction method based on multi-semantic features - Google Patents

Knowledge extraction method based on multi-semantic features Download PDF

Info

Publication number
CN115688752A
CN115688752A CN202211131763.0A CN202211131763A CN115688752A CN 115688752 A CN115688752 A CN 115688752A CN 202211131763 A CN202211131763 A CN 202211131763A CN 115688752 A CN115688752 A CN 115688752A
Authority
CN
China
Prior art keywords
semantic
entity
sentence
word
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211131763.0A
Other languages
Chinese (zh)
Inventor
孔纯熠
高发荣
张启忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202211131763.0A priority Critical patent/CN115688752A/en
Publication of CN115688752A publication Critical patent/CN115688752A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a knowledge extraction method based on multi-semantic features, which comprises the following steps: step one, semantic vector representation; step two, feature coding; step three, entity recognition; and step four, relation classification. Compared with the prior art, the method has the advantages that: the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, so that more accurate feature vectors are provided for entity identification and relation classification, and the performance of extracting relation triples is effectively improved.

Description

Knowledge extraction method based on multi-semantic features
Technical Field
The invention belongs to the field of knowledge graph knowledge extraction research, and particularly relates to a knowledge extraction method based on multi-semantic features.
Background
Knowledge-graph (KG) is a structured semantic knowledge base that stores information in symbolic form. This knowledge base consists of entity nodes and relationships, represented by means of triples (h, r, t). Most of the existing knowledge maps utilize text data in the internet, and 60% to 70% of internet text data exists in the form of unstructured electronic documents. Knowledge extraction is an important task for automatically extracting effective information from unstructured texts, is also a key step for constructing a knowledge graph, and directly influences the construction quality and subsequent application effect of the knowledge graph. In recent years, the research intensity of researchers on knowledge extraction is continuously increased, related technologies are gradually improved, and the knowledge extraction becomes an important technical basis for intelligent applications such as emotion analysis, intelligent question answering, individual recommendation, machine translation and the like. How to automatically and accurately acquire knowledge from heterogeneous mass data sources has become a hot problem for research in academia and industry.
Knowledge extraction refers to detecting entities from information sources and identifying their semantic relationships. According to the fact that the sequence of completion of the entity identification and the relation classification of the two subtasks is different, the knowledge extraction method can be divided into a pipeline method and a joint learning method. The pipeline method is easy to implement, and each part is more flexible. The joint learning method can consider potential dependency between two tasks and solve the problems of interaction loss, information redundancy and the like. In the early stage, the joint extraction method is mainly based on the traditional machine learning method. They have achieved some effect but require the use of manually constructed features.
Recently, with the development of deep neural networks, knowledge extraction methods have successively achieved up-to-date results. The long-short-term memory neural network (LSTM) can protect and control the information flow state, and effectively capture the long-term dependence of sentences, so the knowledge extraction model based on LSTM and its variants is widely applied and makes some breakthrough. Therefore, the problem of low accuracy of the feature vector is still one of the important points of research for the problem of potential context information loss.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problem of potential context information loss, a knowledge extraction method based on multi-semantic features is provided.
In order to achieve the purpose, the invention adopts the technical scheme that:
a knowledge extraction method based on multi-semantic features is a knowledge extraction method combining a Bi-directional long short term memory neural network (Bi-LSTM) and a self-attention machine system. A rich semantic information vector representation with multiple levels and multiple spaces is obtained by connecting the vocabulary vector and the context vector, and then Bi-directional semantic dependence is better captured by Bi-LSTM. And then, realizing entity identification by using a Conditional Random Field (CRF), and connecting and inputting the predicted entity label and the underlying feature vector into a Sigmoid to realize one or more relation classifications for the entity.
The knowledge extraction method based on the multi-semantic features comprises the following steps:
step one, semantic vector representation
Given a sentence w = w 1 ,...,w n ,w i ,i=1,2,.., n represents the word in the sentence, and carries out word embedding preprocessing by using the Glov pre-training language model, and each word w i Conversion into a vector matrix w Glove So that they can be calculated in the neural network model.
Then, character-level vector representation is performed for each word. Taking each character of the word as input, capturing the morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w char . And splicing the word embedding and the character embedding to obtain a vocabulary vector X.
Next, the present invention obtains context embedding by mapping context information to multiple semantic spaces through a multi-headed self-attention mechanism. Converting the vocabulary vector X matrix to obtain a query vector Q ∈ R n*d And a pair of key-value vectors K ∈ R n*d ,V∈R n*d Then, the proportional dot product attention is calculated as follows:
Figure BDA0003848350010000021
multi-headed attention calculates key values and hidden layers linearly h times using different initialization matrices. At each instant, attention mechanisms are performed in parallel. For the ith header, the query, key value, and encoder hidden layer value are described with respect to the coefficient matrix as, W i Q ∈R n*d/h ,W i K ∈R n*d/h And W i V ∈R n*d/h . The scaled dot product attention is then calculated:
Head i =Attention(QW i Q ,KW i K ,VW i V ) (2)
the output vectors of all h parallel headers are concatenated together. Finally, outputting the mixed semantic representation:
X context =Concat(Head 1 ,Head 2 ,...,Head h )·W A (3)
wherein, X context ∈R n*d ,W A ∈R d*d Is a weight matrix for linear operations.
And finally, splicing the vocabulary vector and the semantic vector containing the context information to obtain the final sentence vector representation, namely the input of the next step, in order to mine the rich semantic information of the sentence.
Step two, feature coding
The LSTM unit consists of three multiplier gates. Forget the information that the door decides to discard; the input gate determines updated information; the output gate updates the status of the information. LSTM protects and controls the flow of information primarily through three gates. The present invention obtains past and future information through its extension Bi-LSTM.
Figure BDA0003848350010000031
And
Figure BDA0003848350010000032
the outputs of forward and backward LSTM at time i are represented separately, and the overall output of Bi-LSTM at time i is represented as follows:
Figure BDA0003848350010000033
step three, entity recognition
The present invention represents entity recognition tasks as a sequence tagging problem. An entity is generally composed of several consecutive words in a sentence, so the BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity respectively, and a label is assigned to each word in the entity. The most likely entity label for each word is first calculated over the CRF. Calculating a label score:
s (e) (h i )=V (e) f(W (e) h i +b (e) ) (5)
wherein f (-) is an element activation function, V (e) ∈R p*l ,W (e) ∈R l*2d ,b (e) ∈R l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width.
The invention then proceeds through a linear chain CRF, taking into account the inter-tag considerationsThe order of (a). For the input word w i The scoring sequence is expressed as:
Figure BDA0003848350010000034
wherein,
Figure BDA0003848350010000035
is shown when w i Mark y i Mark score of time, T represents y i To y i+1 The transition fraction of (c). Finally, the computation sentence w is marked as a tag sequence
Figure BDA0003848350010000036
Probability of (c):
Figure BDA0003848350010000037
since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g i Input to the next step, i.e. z i =[h i ;g i ]。
Step four, relation classification
The invention regards the relation extraction task as a multi-head selection problem, can effectively identify all relation triples in the sentence, and realizes the extraction of the overlapping relation. In the present invention, any relationship type can be maintained between two entities, and the semantic relationship between each entity is maintained independently.
A sentence w and a set of relationship labels R are used as input in order to identify relationship triples in the sentence. Given relational tag r k Calculating two entities w i And w j Fraction in between:
s (r) (w j ,w i ,r k )=V (r) f(U (r) z j +W (r) z i +b (r) ) (8)
wherein r, V r And d respectively represents the number of hidden units used for the relation classification task, the weight matrix and the LSTM. f (-) represents the activation function. Word w j And w i Has a relationship type r between k The probability is defined as
Pr(head=w j ,label=r k |w i )=σ(s (r) (w j ,w i ,r k )) (9)
Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probability of all relationships to 1. When the probability is greater than 0.5, then a relationship between the two entities is considered to exist.
Compared with the prior art, the method has the following advantages:
most of the existing knowledge extraction methods consider the characteristics of a vocabulary level and ignore potential contextual semantic information, but the performance of a model directly depends on the accuracy of the obtained characteristics. Aiming at the problem, the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, more accurate feature vectors are provided for entity identification and relationship classification, and the performance of extracting relationship triples is effectively improved.
Drawings
FIG. 1 is a flow chart of extracting relational triples based on multi-semantic feature vectors according to the present invention.
FIG. 2 is a diagram illustrating a structure of embedding character vectors according to the present invention.
Fig. 3 is a diagram of the self-attention mechanism of the present invention.
FIG. 4 is a diagram of a sentence semantic representation structure according to the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation scheme and a specific operation process.
As shown in fig. 1, the present invention provides a knowledge extraction method based on multi-semantic features, which specifically includes the following steps:
the method comprises the following steps: and (4) preprocessing data. The CoNLL04 dataset consists of 1,441 sentences from the Newswer article, annotated as four entity types (Location, organization, peoples, other) and five relationship types (Work-for, kill, organization-Based-In, lives-In, located-In). These contents are randomly divided into a training set, a validation set, and a test set.
Step two: and (4) semantic vector representation. Firstly, inputting a sentence, carrying out word embedding pretreatment on an original corpus by using Glove, and converting the sentence into a 100-dimensional vector matrix
Figure BDA0003848350010000041
Next, each character of the word is used as input through a Bi-LSTM neural network, as shown in FIG. 2. Set the hidden dimension of the LSTM to 25, and then join the two final states together to get the character-level vector representation w char . And splicing the word vector and the character vector to obtain the vocabulary vector representation.
Fig. 3 is a diagram showing a configuration of the self-attention mechanism. Inputting the vocabulary vector X into a multi-head attention mechanism, mapping the context information to a plurality of semantic spaces to obtain a context embedding X context
In order to mine rich semantic information of sentences, output vectors of different modules are spliced to obtain final sentence vector representation. The specific structure is shown in fig. 4. The context module obtained by inputting the vocabulary embedding module into the multi-head self-attention mechanism is spliced with the vocabulary embedding module at a connecting layer to obtain vector representation with multi-layer semantic information.
Step three: inputting the sentence vector obtained in the step two into Bi-LSTM for feature coding to obtain a sentence expression h with long-distance dependence i
Step four: the most likely entity label for each word is calculated by a linear chain CRF. For example, grandeIsle is labeled as B-Location and I-Location, respectively. If a word does not belong to an entity, it is marked as N.
Step five: the type of relationship between the two entities is calculated by the above equation (8) and equation (9).
To verify the effectiveness of the above method, the inventive method was experimented on the CoNLL04 dataset and compared to baseline results. The method model of the invention was developed by using Python and TensorFlow machine learning libraries. To avoid overfitting the model, different Dropout rates are used at the input layer and the hidden layer. And the model is optimized by an Adam optimizer. Furthermore, when the results on the validation set did not improve for 30 consecutive epochs, the training of the model was stopped. More detailed hyper-parameter settings are shown in table 1 below.
TABLE 1 hyper-parameter settings
Figure BDA0003848350010000051
The method selects precision (P), recall rate (R) and F1 score to evaluate and predict results on a data set. If the boundary and the type of the entity are correct, the entity is judged to be correct; a relationship is correct when both the type of the relationship and the parameter entity are correct.
Figure BDA0003848350010000052
The experimental results for the CoNLL04 dataset are shown in Table 2. The result shows that the method can successfully share the association information of the entity and the relationship and learn the complex long-distance correlation. The invention improves the performance of the knowledge triple extraction task by utilizing the importance of the multi-feature semantic information.
Table 2 experimental results of the clonl04 dataset
Figure BDA0003848350010000061
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the scope of the present invention is defined by the claims.

Claims (1)

1. A knowledge extraction method based on multi-semantic features is characterized in that: the knowledge extraction method based on the multi-semantic features comprises the following steps:
step one, semantic vector representation
Given a sentence w = w 1 ,...,w n ,w i I =1,2, n represents words in a sentence, word embedding preprocessing is performed using a Glov pre-training language model, and each word w is processed i Conversion into a vector matrix w Glove So that they can be calculated in the neural network model;
then, performing character level vector representation on each word, taking each character of the word as input, capturing morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w char Word embedding and character embedding are spliced to obtain a vocabulary vector X;
then, mapping the context information to a plurality of semantic spaces through a multi-head self-attention mechanism to obtain context embedding, and performing matrix transformation on the vocabulary vector X to obtain a query vector Q e R n*d And a pair of key-value vectors K ∈ R n*d ,V∈R n*d Then, the proportional dot product attention is calculated as follows:
Figure RE-FDA0004031283120000011
the multi-head attention uses different initialization matrixes to linearly calculate key values and hidden layers h times, the attention mechanism is executed in parallel at each moment, and for the ith head, inquiry, key values and encoder hidden layer values are related to a coefficient matrixIs described as W i Q ∈R n*d/h ,W i K ∈R n*d/h And W i V ∈R n*d/h . The scaled dot product attention is then calculated:
Head i =Attention(QW i Q ,KW i K ,VW i V ) (2)
the output vectors of all h parallel headers are concatenated together and finally the mixed semantic representation is output:
X context =Concat(Head 1 ,Head 2 ,...,Head h )·W A (3)
wherein, X context ∈R n*d ,W A ∈R d*d A weight matrix that is a linear operation;
finally, in order to mine the rich semantic information of the sentence, the vocabulary vector and the semantic vector containing the context information are spliced to obtain the final sentence vector representation, namely the input of the next step;
step two, feature coding
The LSTM unit consists of three multiplier gates, a forgetting gate determines the discarded information, an input gate determines the updated information, an output gate updates the state of the information, the LSTM protects and controls the flow of information mainly through the three gates, acquires past and future information through its extension Bi-LSTM,
Figure RE-FDA0004031283120000012
and
Figure RE-FDA0004031283120000013
the outputs of forward and backward LSTM at time i are represented separately, and the overall output of Bi-LSTM at time i is represented as follows:
Figure RE-FDA0004031283120000021
step three, entity recognition
Expressing the entity recognition task as a sequence tagging problem, an entity generally consists of several consecutive words in a sentence, so a BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity, respectively, one tag is assigned to each word in the entity, the most likely entity tag for each word is first calculated over CRF, the tag score is calculated:
s (e) (h i )=V (e) f(W (e) h i +b (e) ) (5)
wherein f (-) is an element activation function, V (e) ∈R p*l ,W (e) ∈R l*2d ,b (e) ∈R l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width;
the invention then proceeds through a linear chain CRF, considering the order between the tags, for the input word w i The scoring sequence is expressed as:
Figure RE-FDA0004031283120000022
wherein,
Figure RE-FDA0004031283120000023
is shown when w i Mark y i Mark score of time, T represents y i To y i+1 And finally, calculating the mark of the sentence w as a tag sequence
Figure RE-FDA0004031283120000024
Probability of (c):
Figure RE-FDA0004031283120000025
since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g i Input to the next step, i.e. z i =[h i ;g i ];
Step four, relation classification
The relation extraction task is regarded as a multi-head selection problem, all relation triples in a sentence can be effectively identified, the extraction of overlapping relations is realized, any relation type can be kept between two entities, and the semantic relation between each entity is kept independent;
a sentence w and a set of relationship labels R are used as input, the purpose is to identify relationship triples in the sentence, and the relationship labels R are given k Calculating two entities w i And w j Fraction in between:
s (r) (w j ,w i ,r k )=V (r) f(U (r) z j +W (r) z i +b (r) ) (8)
wherein r, V r D represents the number of hidden units for the relation classification task, weight matrix, LSTM, f (-) represents the activation function, and the word w j And w i Has a relationship type r between k The probability is defined as
Pr(head=w j ,label=r k |w i )=σ(s (r) (w j ,w i ,r k )) (9)
Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probabilities of all relationships to 1, when the probability is greater than 0.5, then a relationship between two entities is considered to exist.
CN202211131763.0A 2022-09-16 2022-09-16 Knowledge extraction method based on multi-semantic features Pending CN115688752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211131763.0A CN115688752A (en) 2022-09-16 2022-09-16 Knowledge extraction method based on multi-semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211131763.0A CN115688752A (en) 2022-09-16 2022-09-16 Knowledge extraction method based on multi-semantic features

Publications (1)

Publication Number Publication Date
CN115688752A true CN115688752A (en) 2023-02-03

Family

ID=85063245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211131763.0A Pending CN115688752A (en) 2022-09-16 2022-09-16 Knowledge extraction method based on multi-semantic features

Country Status (1)

Country Link
CN (1) CN115688752A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151241A (en) * 2023-04-19 2023-05-23 湖南马栏山视频先进技术研究院有限公司 Entity identification method and device
CN116595992A (en) * 2023-07-19 2023-08-15 江西师范大学 Single-step extraction method for terms and types of binary groups and model thereof
CN116629264A (en) * 2023-05-24 2023-08-22 成都信息工程大学 Relation extraction method based on multiple word embedding and multi-head self-attention mechanism
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151241A (en) * 2023-04-19 2023-05-23 湖南马栏山视频先进技术研究院有限公司 Entity identification method and device
CN116629264A (en) * 2023-05-24 2023-08-22 成都信息工程大学 Relation extraction method based on multiple word embedding and multi-head self-attention mechanism
CN116629264B (en) * 2023-05-24 2024-01-23 成都信息工程大学 Relation extraction method based on multiple word embedding and multi-head self-attention mechanism
CN116595992A (en) * 2023-07-19 2023-08-15 江西师范大学 Single-step extraction method for terms and types of binary groups and model thereof
CN116595992B (en) * 2023-07-19 2023-09-19 江西师范大学 Single-step extraction method for terms and types of binary groups and model thereof
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117744787B (en) * 2024-02-20 2024-05-07 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality

Similar Documents

Publication Publication Date Title
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN110210037B (en) Syndrome-oriented medical field category detection method
CN112528676B (en) Document-level event argument extraction method
CN112818676B (en) Medical entity relationship joint extraction method
CN115688752A (en) Knowledge extraction method based on multi-semantic features
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
Li et al. UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning
Zhang et al. Aspect-based sentiment analysis for user reviews
Xiao et al. Hybrid attention-based transformer block model for distant supervision relation extraction
CN110765240A (en) Semantic matching evaluation method for multiple related sentence pairs
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN115879546A (en) Method and system for constructing composite neural network psychology medicine knowledge map
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN110852089A (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
Liu et al. Hierarchical graph convolutional networks for structured long document classification
Pandey et al. Natural language generation using sequential models: a survey
Boudad et al. Exploring the use of word embedding and deep learning in arabic sentiment analysis
Jia et al. Hybrid neural tagging model for open relation extraction
CN113779966A (en) Mongolian emotion analysis method of bidirectional CNN-RNN depth model based on attention
CN115795037B (en) Multi-label text classification method based on label perception
CN116701665A (en) Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
Tokala et al. Label informed hierarchical transformers for sequential sentence classification in scientific abstracts
CN116302953A (en) Software defect positioning method based on enhanced embedded vector semantic representation
Zhu et al. ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition
CN115169429A (en) Lightweight aspect-level text emotion analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination