CN115688752A - Knowledge extraction method based on multi-semantic features - Google Patents
Knowledge extraction method based on multi-semantic features Download PDFInfo
- Publication number
- CN115688752A CN115688752A CN202211131763.0A CN202211131763A CN115688752A CN 115688752 A CN115688752 A CN 115688752A CN 202211131763 A CN202211131763 A CN 202211131763A CN 115688752 A CN115688752 A CN 115688752A
- Authority
- CN
- China
- Prior art keywords
- semantic
- entity
- sentence
- word
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 49
- 230000007246 mechanism Effects 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000000877 morphologic effect Effects 0.000 claims description 2
- 238000003062 neural network model Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims 1
- 238000000034 method Methods 0.000 abstract description 15
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a knowledge extraction method based on multi-semantic features, which comprises the following steps: step one, semantic vector representation; step two, feature coding; step three, entity recognition; and step four, relation classification. Compared with the prior art, the method has the advantages that: the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, so that more accurate feature vectors are provided for entity identification and relation classification, and the performance of extracting relation triples is effectively improved.
Description
Technical Field
The invention belongs to the field of knowledge graph knowledge extraction research, and particularly relates to a knowledge extraction method based on multi-semantic features.
Background
Knowledge-graph (KG) is a structured semantic knowledge base that stores information in symbolic form. This knowledge base consists of entity nodes and relationships, represented by means of triples (h, r, t). Most of the existing knowledge maps utilize text data in the internet, and 60% to 70% of internet text data exists in the form of unstructured electronic documents. Knowledge extraction is an important task for automatically extracting effective information from unstructured texts, is also a key step for constructing a knowledge graph, and directly influences the construction quality and subsequent application effect of the knowledge graph. In recent years, the research intensity of researchers on knowledge extraction is continuously increased, related technologies are gradually improved, and the knowledge extraction becomes an important technical basis for intelligent applications such as emotion analysis, intelligent question answering, individual recommendation, machine translation and the like. How to automatically and accurately acquire knowledge from heterogeneous mass data sources has become a hot problem for research in academia and industry.
Knowledge extraction refers to detecting entities from information sources and identifying their semantic relationships. According to the fact that the sequence of completion of the entity identification and the relation classification of the two subtasks is different, the knowledge extraction method can be divided into a pipeline method and a joint learning method. The pipeline method is easy to implement, and each part is more flexible. The joint learning method can consider potential dependency between two tasks and solve the problems of interaction loss, information redundancy and the like. In the early stage, the joint extraction method is mainly based on the traditional machine learning method. They have achieved some effect but require the use of manually constructed features.
Recently, with the development of deep neural networks, knowledge extraction methods have successively achieved up-to-date results. The long-short-term memory neural network (LSTM) can protect and control the information flow state, and effectively capture the long-term dependence of sentences, so the knowledge extraction model based on LSTM and its variants is widely applied and makes some breakthrough. Therefore, the problem of low accuracy of the feature vector is still one of the important points of research for the problem of potential context information loss.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problem of potential context information loss, a knowledge extraction method based on multi-semantic features is provided.
In order to achieve the purpose, the invention adopts the technical scheme that:
a knowledge extraction method based on multi-semantic features is a knowledge extraction method combining a Bi-directional long short term memory neural network (Bi-LSTM) and a self-attention machine system. A rich semantic information vector representation with multiple levels and multiple spaces is obtained by connecting the vocabulary vector and the context vector, and then Bi-directional semantic dependence is better captured by Bi-LSTM. And then, realizing entity identification by using a Conditional Random Field (CRF), and connecting and inputting the predicted entity label and the underlying feature vector into a Sigmoid to realize one or more relation classifications for the entity.
The knowledge extraction method based on the multi-semantic features comprises the following steps:
step one, semantic vector representation
Given a sentence w = w 1 ,...,w n ,w i ,i=1,2,.., n represents the word in the sentence, and carries out word embedding preprocessing by using the Glov pre-training language model, and each word w i Conversion into a vector matrix w Glove So that they can be calculated in the neural network model.
Then, character-level vector representation is performed for each word. Taking each character of the word as input, capturing the morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w char . And splicing the word embedding and the character embedding to obtain a vocabulary vector X.
Next, the present invention obtains context embedding by mapping context information to multiple semantic spaces through a multi-headed self-attention mechanism. Converting the vocabulary vector X matrix to obtain a query vector Q ∈ R n*d And a pair of key-value vectors K ∈ R n*d ,V∈R n*d Then, the proportional dot product attention is calculated as follows:
multi-headed attention calculates key values and hidden layers linearly h times using different initialization matrices. At each instant, attention mechanisms are performed in parallel. For the ith header, the query, key value, and encoder hidden layer value are described with respect to the coefficient matrix as, W i Q ∈R n*d/h ,W i K ∈R n*d/h And W i V ∈R n*d/h . The scaled dot product attention is then calculated:
Head i =Attention(QW i Q ,KW i K ,VW i V ) (2)
the output vectors of all h parallel headers are concatenated together. Finally, outputting the mixed semantic representation:
X context =Concat(Head 1 ,Head 2 ,...,Head h )·W A (3)
wherein, X context ∈R n*d ,W A ∈R d*d Is a weight matrix for linear operations.
And finally, splicing the vocabulary vector and the semantic vector containing the context information to obtain the final sentence vector representation, namely the input of the next step, in order to mine the rich semantic information of the sentence.
Step two, feature coding
The LSTM unit consists of three multiplier gates. Forget the information that the door decides to discard; the input gate determines updated information; the output gate updates the status of the information. LSTM protects and controls the flow of information primarily through three gates. The present invention obtains past and future information through its extension Bi-LSTM.Andthe outputs of forward and backward LSTM at time i are represented separately, and the overall output of Bi-LSTM at time i is represented as follows:
step three, entity recognition
The present invention represents entity recognition tasks as a sequence tagging problem. An entity is generally composed of several consecutive words in a sentence, so the BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity respectively, and a label is assigned to each word in the entity. The most likely entity label for each word is first calculated over the CRF. Calculating a label score:
s (e) (h i )=V (e) f(W (e) h i +b (e) ) (5)
wherein f (-) is an element activation function, V (e) ∈R p*l ,W (e) ∈R l*2d ,b (e) ∈R l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width.
The invention then proceeds through a linear chain CRF, taking into account the inter-tag considerationsThe order of (a). For the input word w i The scoring sequence is expressed as:
wherein,is shown when w i Mark y i Mark score of time, T represents y i To y i+1 The transition fraction of (c). Finally, the computation sentence w is marked as a tag sequenceProbability of (c):
since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g i Input to the next step, i.e. z i =[h i ;g i ]。
Step four, relation classification
The invention regards the relation extraction task as a multi-head selection problem, can effectively identify all relation triples in the sentence, and realizes the extraction of the overlapping relation. In the present invention, any relationship type can be maintained between two entities, and the semantic relationship between each entity is maintained independently.
A sentence w and a set of relationship labels R are used as input in order to identify relationship triples in the sentence. Given relational tag r k Calculating two entities w i And w j Fraction in between:
s (r) (w j ,w i ,r k )=V (r) f(U (r) z j +W (r) z i +b (r) ) (8)
wherein r, V r And d respectively represents the number of hidden units used for the relation classification task, the weight matrix and the LSTM. f (-) represents the activation function. Word w j And w i Has a relationship type r between k The probability is defined as
Pr(head=w j ,label=r k |w i )=σ(s (r) (w j ,w i ,r k )) (9)
Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probability of all relationships to 1. When the probability is greater than 0.5, then a relationship between the two entities is considered to exist.
Compared with the prior art, the method has the following advantages:
most of the existing knowledge extraction methods consider the characteristics of a vocabulary level and ignore potential contextual semantic information, but the performance of a model directly depends on the accuracy of the obtained characteristics. Aiming at the problem, the invention provides a new relation triple extraction method. The method comprises the steps of firstly obtaining word vector representation through a pre-training language model, then utilizing Bi-LSTM to carry out feature coding on character-level features, and coding context semantic information through a multi-head self-attention mechanism to obtain the internal structure and long-distance dependency relationship of a sentence. And then, the semantic features of different levels are spliced to obtain efficient semantic representation, more accurate feature vectors are provided for entity identification and relationship classification, and the performance of extracting relationship triples is effectively improved.
Drawings
FIG. 1 is a flow chart of extracting relational triples based on multi-semantic feature vectors according to the present invention.
FIG. 2 is a diagram illustrating a structure of embedding character vectors according to the present invention.
Fig. 3 is a diagram of the self-attention mechanism of the present invention.
FIG. 4 is a diagram of a sentence semantic representation structure according to the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation scheme and a specific operation process.
As shown in fig. 1, the present invention provides a knowledge extraction method based on multi-semantic features, which specifically includes the following steps:
the method comprises the following steps: and (4) preprocessing data. The CoNLL04 dataset consists of 1,441 sentences from the Newswer article, annotated as four entity types (Location, organization, peoples, other) and five relationship types (Work-for, kill, organization-Based-In, lives-In, located-In). These contents are randomly divided into a training set, a validation set, and a test set.
Step two: and (4) semantic vector representation. Firstly, inputting a sentence, carrying out word embedding pretreatment on an original corpus by using Glove, and converting the sentence into a 100-dimensional vector matrixNext, each character of the word is used as input through a Bi-LSTM neural network, as shown in FIG. 2. Set the hidden dimension of the LSTM to 25, and then join the two final states together to get the character-level vector representation w char . And splicing the word vector and the character vector to obtain the vocabulary vector representation.
Fig. 3 is a diagram showing a configuration of the self-attention mechanism. Inputting the vocabulary vector X into a multi-head attention mechanism, mapping the context information to a plurality of semantic spaces to obtain a context embedding X context 。
In order to mine rich semantic information of sentences, output vectors of different modules are spliced to obtain final sentence vector representation. The specific structure is shown in fig. 4. The context module obtained by inputting the vocabulary embedding module into the multi-head self-attention mechanism is spliced with the vocabulary embedding module at a connecting layer to obtain vector representation with multi-layer semantic information.
Step three: inputting the sentence vector obtained in the step two into Bi-LSTM for feature coding to obtain a sentence expression h with long-distance dependence i 。
Step four: the most likely entity label for each word is calculated by a linear chain CRF. For example, grandeIsle is labeled as B-Location and I-Location, respectively. If a word does not belong to an entity, it is marked as N.
Step five: the type of relationship between the two entities is calculated by the above equation (8) and equation (9).
To verify the effectiveness of the above method, the inventive method was experimented on the CoNLL04 dataset and compared to baseline results. The method model of the invention was developed by using Python and TensorFlow machine learning libraries. To avoid overfitting the model, different Dropout rates are used at the input layer and the hidden layer. And the model is optimized by an Adam optimizer. Furthermore, when the results on the validation set did not improve for 30 consecutive epochs, the training of the model was stopped. More detailed hyper-parameter settings are shown in table 1 below.
TABLE 1 hyper-parameter settings
The method selects precision (P), recall rate (R) and F1 score to evaluate and predict results on a data set. If the boundary and the type of the entity are correct, the entity is judged to be correct; a relationship is correct when both the type of the relationship and the parameter entity are correct.
The experimental results for the CoNLL04 dataset are shown in Table 2. The result shows that the method can successfully share the association information of the entity and the relationship and learn the complex long-distance correlation. The invention improves the performance of the knowledge triple extraction task by utilizing the importance of the multi-feature semantic information.
Table 2 experimental results of the clonl04 dataset
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the scope of the present invention is defined by the claims.
Claims (1)
1. A knowledge extraction method based on multi-semantic features is characterized in that: the knowledge extraction method based on the multi-semantic features comprises the following steps:
step one, semantic vector representation
Given a sentence w = w 1 ,...,w n ,w i I =1,2, n represents words in a sentence, word embedding preprocessing is performed using a Glov pre-training language model, and each word w is processed i Conversion into a vector matrix w Glove So that they can be calculated in the neural network model;
then, performing character level vector representation on each word, taking each character of the word as input, capturing morphological characteristics of the word by utilizing a Bi-LSTM neural network to obtain a character level vector matrix w char Word embedding and character embedding are spliced to obtain a vocabulary vector X;
then, mapping the context information to a plurality of semantic spaces through a multi-head self-attention mechanism to obtain context embedding, and performing matrix transformation on the vocabulary vector X to obtain a query vector Q e R n*d And a pair of key-value vectors K ∈ R n*d ,V∈R n*d Then, the proportional dot product attention is calculated as follows:
the multi-head attention uses different initialization matrixes to linearly calculate key values and hidden layers h times, the attention mechanism is executed in parallel at each moment, and for the ith head, inquiry, key values and encoder hidden layer values are related to a coefficient matrixIs described as W i Q ∈R n*d/h ,W i K ∈R n*d/h And W i V ∈R n*d/h . The scaled dot product attention is then calculated:
Head i =Attention(QW i Q ,KW i K ,VW i V ) (2)
the output vectors of all h parallel headers are concatenated together and finally the mixed semantic representation is output:
X context =Concat(Head 1 ,Head 2 ,...,Head h )·W A (3)
wherein, X context ∈R n*d ,W A ∈R d*d A weight matrix that is a linear operation;
finally, in order to mine the rich semantic information of the sentence, the vocabulary vector and the semantic vector containing the context information are spliced to obtain the final sentence vector representation, namely the input of the next step;
step two, feature coding
The LSTM unit consists of three multiplier gates, a forgetting gate determines the discarded information, an input gate determines the updated information, an output gate updates the state of the information, the LSTM protects and controls the flow of information mainly through the three gates, acquires past and future information through its extension Bi-LSTM,andthe outputs of forward and backward LSTM at time i are represented separately, and the overall output of Bi-LSTM at time i is represented as follows:
step three, entity recognition
Expressing the entity recognition task as a sequence tagging problem, an entity generally consists of several consecutive words in a sentence, so a BIO tagging scheme is adopted, i.e. B, I and O are used to represent the beginning, inside and outside of the entity, respectively, one tag is assigned to each word in the entity, the most likely entity tag for each word is first calculated over CRF, the tag score is calculated:
s (e) (h i )=V (e) f(W (e) h i +b (e) ) (5)
wherein f (-) is an element activation function, V (e) ∈R p*l ,W (e) ∈R l*2d ,b (e) ∈R l D is the hidden size of the LSTM, p is the type of entity type, and l is the layer width;
the invention then proceeds through a linear chain CRF, considering the order between the tags, for the input word w i The scoring sequence is expressed as:
wherein,is shown when w i Mark y i Mark score of time, T represents y i To y i+1 And finally, calculating the mark of the sentence w as a tag sequenceProbability of (c):
since the prediction of the relationship classification depends to some extent on the result of the entity identification, the obtained entity information is taken as an embedded vector g i Input to the next step, i.e. z i =[h i ;g i ];
Step four, relation classification
The relation extraction task is regarded as a multi-head selection problem, all relation triples in a sentence can be effectively identified, the extraction of overlapping relations is realized, any relation type can be kept between two entities, and the semantic relation between each entity is kept independent;
a sentence w and a set of relationship labels R are used as input, the purpose is to identify relationship triples in the sentence, and the relationship labels R are given k Calculating two entities w i And w j Fraction in between:
s (r) (w j ,w i ,r k )=V (r) f(U (r) z j +W (r) z i +b (r) ) (8)
wherein r, V r D represents the number of hidden units for the relation classification task, weight matrix, LSTM, f (-) represents the activation function, and the word w j And w i Has a relationship type r between k The probability is defined as
Pr(head=w j ,label=r k |w i )=σ(s (r) (w j ,w i ,r k )) (9)
Where σ is the sigmoid function. sigmoid assumes that all relationships are independent of each other and it does not add the probabilities of all relationships to 1, when the probability is greater than 0.5, then a relationship between two entities is considered to exist.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211131763.0A CN115688752A (en) | 2022-09-16 | 2022-09-16 | Knowledge extraction method based on multi-semantic features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211131763.0A CN115688752A (en) | 2022-09-16 | 2022-09-16 | Knowledge extraction method based on multi-semantic features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115688752A true CN115688752A (en) | 2023-02-03 |
Family
ID=85063245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211131763.0A Pending CN115688752A (en) | 2022-09-16 | 2022-09-16 | Knowledge extraction method based on multi-semantic features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115688752A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151241A (en) * | 2023-04-19 | 2023-05-23 | 湖南马栏山视频先进技术研究院有限公司 | Entity identification method and device |
CN116595992A (en) * | 2023-07-19 | 2023-08-15 | 江西师范大学 | Single-step extraction method for terms and types of binary groups and model thereof |
CN116629264A (en) * | 2023-05-24 | 2023-08-22 | 成都信息工程大学 | Relation extraction method based on multiple word embedding and multi-head self-attention mechanism |
CN117744787A (en) * | 2024-02-20 | 2024-03-22 | 中国电子科技集团公司第十研究所 | Intelligent measurement method for first-order research rule knowledge rationality |
-
2022
- 2022-09-16 CN CN202211131763.0A patent/CN115688752A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151241A (en) * | 2023-04-19 | 2023-05-23 | 湖南马栏山视频先进技术研究院有限公司 | Entity identification method and device |
CN116629264A (en) * | 2023-05-24 | 2023-08-22 | 成都信息工程大学 | Relation extraction method based on multiple word embedding and multi-head self-attention mechanism |
CN116629264B (en) * | 2023-05-24 | 2024-01-23 | 成都信息工程大学 | Relation extraction method based on multiple word embedding and multi-head self-attention mechanism |
CN116595992A (en) * | 2023-07-19 | 2023-08-15 | 江西师范大学 | Single-step extraction method for terms and types of binary groups and model thereof |
CN116595992B (en) * | 2023-07-19 | 2023-09-19 | 江西师范大学 | Single-step extraction method for terms and types of binary groups and model thereof |
CN117744787A (en) * | 2024-02-20 | 2024-03-22 | 中国电子科技集团公司第十研究所 | Intelligent measurement method for first-order research rule knowledge rationality |
CN117744787B (en) * | 2024-02-20 | 2024-05-07 | 中国电子科技集团公司第十研究所 | Intelligent measurement method for first-order research rule knowledge rationality |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984724B (en) | Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation | |
CN110210037B (en) | Syndrome-oriented medical field category detection method | |
CN112528676B (en) | Document-level event argument extraction method | |
CN112818676B (en) | Medical entity relationship joint extraction method | |
CN115688752A (en) | Knowledge extraction method based on multi-semantic features | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
Li et al. | UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning | |
Zhang et al. | Aspect-based sentiment analysis for user reviews | |
Xiao et al. | Hybrid attention-based transformer block model for distant supervision relation extraction | |
CN110765240A (en) | Semantic matching evaluation method for multiple related sentence pairs | |
CN113705238A (en) | Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN115879546A (en) | Method and system for constructing composite neural network psychology medicine knowledge map | |
CN115759092A (en) | Network threat information named entity identification method based on ALBERT | |
CN110852089A (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
Liu et al. | Hierarchical graph convolutional networks for structured long document classification | |
Pandey et al. | Natural language generation using sequential models: a survey | |
Boudad et al. | Exploring the use of word embedding and deep learning in arabic sentiment analysis | |
Jia et al. | Hybrid neural tagging model for open relation extraction | |
CN113779966A (en) | Mongolian emotion analysis method of bidirectional CNN-RNN depth model based on attention | |
CN115795037B (en) | Multi-label text classification method based on label perception | |
CN116701665A (en) | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method | |
Tokala et al. | Label informed hierarchical transformers for sequential sentence classification in scientific abstracts | |
CN116302953A (en) | Software defect positioning method based on enhanced embedded vector semantic representation | |
Zhu et al. | ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition | |
CN115169429A (en) | Lightweight aspect-level text emotion analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |