CN110807084A

CN110807084A - Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy

Info

Publication number: CN110807084A
Application number: CN201910404547.0A
Authority: CN
Inventors: 董志安; 吕学强; 孙少奇
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2020-02-18

Abstract

The invention relates to a patent term relation extraction method of Bi-LSTM and keyword strategy based on attention mechanism, which comprises the following steps: step 1): preprocessing a patent text, identifying term characteristics, adding position information, obtaining category keyword characteristics through an improved TextRank algorithm, and forming a vector matrix; step 2): importing the vector matrix into a Bi-LSTM model, and acquiring the overall characteristics of the text information by adopting an attention mechanism; step 3): selecting key features of each sentence as local features by utilizing the maximum pooling layer; step 4): fusing the global features and the local features; step 5): and outputting a classification result by using a softmax classifier. The invention provides a patent term relation extraction method based on a Bi-LSTM and keyword strategy of an attention mechanism, which is based on patent term relation extraction and aims at solving the problem of long-distance dependence in the traditional deep learning method. Through various experimental comparisons, the effect of the invention is superior to that of the existing method, and the invention can well meet the requirements of practical application.

Description

Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy

Technical Field

The invention belongs to the technical field of patent term relation extraction, and particularly relates to a Bi-LSTM and keyword strategy patent term relation extraction method based on an attention mechanism.

Background

With social development and scientific and technological progress, people gradually increase the protection consciousness of scientific research achievements, the patent application number also rises year by year, in order to more effectively analyze the relation between patents and optimize the retrieval of patents, the research of patent term relation automatic extraction is emphasized by more and more scholars, the requirements of people cannot be met by manual collection and extraction of unsupervised learning algorithms in the past, and the patent term relation is necessarily extracted by a computer automatically. The patent term relationship automatic extraction has important effects on the work of patent information retrieval, patent similarity detection, patent field ontology construction, patent knowledge map construction, potential semantic analysis and the like.

At present, the main research methods for relation extraction include a method based on pattern matching, a method based on dictionary driving, a machine learning method based on statistics, and a method based on multi-method mixing, but these methods all require manual extraction of features, such as part of speech, dependency relationship, semantic roles, and the like; or rely on natural language processing tools to some extent, such as part-of-speech tagging, syntactic analysis, etc., however, the processing results of different tools may have a certain difference, thereby affecting the final extraction result.

In recent years, entity relation extraction by using a deep learning method becomes a mainstream, effective text features can be automatically learned and obtained, and the method has performance superior to that of the traditional method in a plurality of natural language processing tasks under the condition of not using a basic natural language processing tool, but the method still has defects in representing local features and global features of sentences.

Disclosure of Invention

In view of the above problems in the prior art, the present invention is directed to a method for extracting patent term relationships based on attention-based Bi-LSTM and keyword strategy, which can avoid the above technical disadvantages.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows:

a patent term relationship extraction method based on Bi-LSTM and keyword strategy of attention mechanism comprises the following steps:

step 1): preprocessing a patent text, identifying term characteristics, adding position information, obtaining category keyword characteristics through an improved TextRank algorithm, and forming a vector matrix;

step 2): importing the vector matrix into a Bi-LSTM model, and acquiring the overall characteristics of the text information by adopting an attention mechanism;

step 3): selecting key features of each sentence as local features by utilizing the maximum pooling layer;

step 4): fusing the global features and the local features;

step 5): and outputting a classification result by using a softmax classifier.

Further, the improved TextRank algorithm in the step 1) is specifically as follows:

step A: inputting a patent text information set S ═ S to be processed₁，s₂，s₃，...，s_nAnd setting parameters as follows: damping coefficient d, sliding window size w, maximum iteration number I and iteration stop threshold

And B: each corresponding text S in the patent text information set S_iPerforming word segmentation and part-of-speech tagging, filtering stop words, and only keeping words (verbs, adjectives and nouns) with specified part-of-speech, wherein the words form final candidate category characteristic keywords;

and C: calculating the TF-IDF value of each word in the patent text information set S through a TF-IDF algorithm;

step D: traversing patent text information words based on the sliding window size w, and then constructing an edge between any two words by adopting a co-occurrence relation (co-occurrence), thereby constructing a second expression s_iKeyword graph G formed by the words in (1)_i；

Step E: iterative computation according to equation (1)Keyword graph G_iUntil convergence, formula (1) is as follows:

wherein: w (v)_i) Is a node v_iThe weight of (2); d is a damping coefficient which represents the probability of pointing from a specific node to any other node in the graph and is set to be 0.85; in (v)_i) Representing a pointing node v_iA set of nodes of (c); out (v)_j) Representing a slave node v_jA set of nodes pointed by the starting edge; w is a_jiRepresenting a node v_jTo v_iThe weight of the edge of (v), W' (v)_i)_TF-IDFRepresenting a node v_iA TF-IDF value of (1);

step F: keyword graph G by weight_iThe words in the list are sorted, and the word with the largest weight and the part of speech being the verb is selected as the category characteristic keyword.

Further, the step 2) is specifically as follows: the formulas used in the attention layer are shown in (2), (3) and (4):

M＝tanh(H)(2)

α＝softmax(w^TM)(3)

where H is the matrix [ H ] output by the Bi-LSTM layer for T moments₁，h₂，h₃，...，h_T]And is

d^wA dimension representing a word vector; w represents a training parameter vector and w^TRepresenting the transpose of w α representing the attention probability distribution vector h^*Representing the learned sentence representation.

Further, the step 3) is specifically as follows: the output result H of the Bi-LSTM model is calculated statistically by using a maximum pooling method, as shown in formula (5):

h′＝maxpool(H)(5)

further, the step 4) is specifically as follows: feature fusion is to combine the calculation results of the attention layer and the pooling layer, as shown in formula (6):

whereinRepresenting vector stitching.

Further, the step 5) specifically comprises: predicting tags from a set of discrete classes Y of sentences S using a softmax classifier

The classifier takes the result after feature fusion as input, and the formulas are shown as (7) and (8):

the loss function used is a negative log-likelihood function of the true class label y and uses L2 regularization to prevent overfitting, the calculation formula is shown in (9):

wherein the content of the first and second substances,

a one-hot form representing the real category label y,

representing the estimated probability of softmax for each class; m represents the number of training samples;

l2 regularization superparameter;

the model may train parameters.

The patent term relationship extraction method based on the attention mechanism Bi-LSTM and the keyword strategy provided by the invention is based on the patent term relationship extraction and aims at the problem of long-distance dependence in the traditional deep learning method. Through various experimental comparisons, the effect of the invention is superior to that of the existing method, and the invention can well meet the requirements of practical application.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is sentence vectorization;

FIG. 3 is a view of the overall framework of the model;

FIG. 4 is a comparison of the internal experiments of the model;

FIG. 5 is a comparative graph of different experimental methods.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in FIG. 1, a patent term relationship extraction method based on attention mechanism Bi-LSTM and keyword strategy comprises the following steps:

preprocessing a patent text, namely segmenting the patent text by commas, semicolons and sentence numbers, identifying term characteristics in each sentence, simultaneously adding position information, obtaining category keyword characteristics representing each sentence by an improved TextRank keyword extraction algorithm, and forming a final vector matrix by the sentences and the extracted characteristics.

The word vector model is as follows:

word vector (word embedding), which is also called word embedding, is a distributed representation of words, and maps each word in an input sentence into a vector of continuous real values, which can capture syntactic and semantic information of the word. Given a sentence s containing k words, { w ═ w₁，w₂，w₃，...，w_kWhere each word w in the sentence_iAre mapped into a low-dimensional real number vector x_iThe words in each sentence are mapped by equation (1).

x_i＝W^word·Vⁱ(1)

Wherein x is_iIs the word w_iIn the form of a vector of (a),

is a vector matrix derived from word2vec training, where d is^wAs a dimension of a word vector, m is a fixed-size word list, VⁱIs the word w_iThe bag of words of (one-hot form). Thereby obtaining a word vector representation form V of each sentence_s＝{x₁，x₂，x₃，...，x_k}。

The position vector features are as follows:

in the task of patent term relation extraction, words capable of highlighting the relation between terms are often distributed nearby the terms, so in order to extract the relation between the terms more accurately, the distance between each word and two terms is searched for to generate a position vector matrix, the position vector information of the word is obtained, and the position vector information is spliced behind the word vector of the sentence. For a sentence s containing k words, { w ═ w₁，w₂，w₃，...，w_kEvery word w in_iThe relative distance from the two terms is

Wherein

Showing the index of the position in the sentence at the current time,respectively, the position indexes of the two terms in the sentence. And generating a position vector matrix by the word2vec tool according to the obtained word position information. The dimensions of each word vector at this time are:

d^w′＝d^w+2d^p(2)

wherein d is^w′Representing the vector dimension after stitching location vector information, d^wRepresenting the original word vector dimension, d^pRepresenting the position vector dimension.

For example, the sentence "the number of chargers controlled by the charging control system" is followed by "the charging control system/control/the charger/the access/the number", and "the charging control system" and "the charger" are used as two patent terms in this sentence, then the distance from the word "control" to the term "charging control system" is 1, and the distance from the word "charger" is-1. The distance of the word "number" to the term "charge controller" is 5 and the distance to the term "charger" is 3.

The category keyword feature extraction based on sentence level is as follows:

the TextRank algorithm is a sorting algorithm based on a graph model and can be used for extracting keywords of a text. The method comprises the steps of dividing a text into a plurality of composition units (words and sentences), establishing a graph model, sequencing important components in the text by using a voting mechanism, and extracting keywords only by using the information of a single document.

The TestRank algorithm is simple and easy to use, utilizes the relevance between words, and can realize keyword extraction only by using the information of a single document, but the TextRank only depends on the document and has the same importance degree of each word during initialization, so that the keywords in the text are difficult to accurately extract. The TF-IDF algorithm depends on the corpus environment, and the importance degree of a word can be known in advance, which is a place superior to the TextRank algorithm. Therefore, the TF-IDF algorithm idea is added into the TextRank algorithm, the initialization value of the importance degree of each word is described by the TF-IDF value, the importance degree of the word is highlighted at the beginning of the TextRank algorithm, and the efficiency and the accuracy of the algorithm are improved. An Improved TextRank (impromoted TextRank, IMTR) algorithm, described as follows:

(1) inputting a patent text information set S ═ S to be processed₁，s₂，s₃，...，s_nAnd setting parameters as follows: damping coefficient d, sliding window size w, maximum iteration number I and iteration stop threshold

(2) Each corresponding text S in the patent text information set S_iPerforming word segmentation and part-of-speech tagging, filtering stop words, and only keeping words (verbs, adjectives and nouns) with specified part-of-speech, wherein the words form final candidate category characteristic keywords;

(3) calculating the TF-IDF value of each word in the patent text information set S through a TF-IDF algorithm;

(4) traversing patent text information words based on the sliding window size w, and then constructing an edge between any two words by adopting a co-occurrence relation (co-occurrence), thereby constructing a second expression s_iKeyword graph G formed by the words in (1)_i；

(5) According to the formula

Iterative computation keyword graph G_iThe weight of each word in the list until convergence;

(6) keyword graph G by weight_iThe words in the list are sorted, and the word with the largest weight and the part of speech being the verb is selected as the category characteristic keyword.

In the algorithm W (v)_i) Is a node v_iThe weight of (2); d is damping coefficient, shown from the figureThe probability that a certain specific node points to any other node is generally set to be 0.85; in (v)_i) Representing a pointing node v_iA set of nodes of (c); out (v)_j) Representing a slave node v_jA set of nodes pointed by the starting edge; w is a_jiRepresenting a node v_jTo v_iThe weight of the edge of (v), W' (v)_i)_TF-IDFRepresenting a node v_iThe TF-IDF value of (1).

The text features adopted by the invention include term features, position information and keyword features, and are subjected to vectorization processing and spliced into text information word vectors to form final vectorization representation as shown in fig. 2.

a Long Short Term Memory network (LSTM) is proposed by Hochreiter et al in 1997 in Long Short-Term Memory, and is a special type in a Recurrent Neural Network (RNN), so that effective utilization of Long-distance information is realized. The original intention of LSTM design was to ensure the integrity of information in the real world, and memory cells were introduced, i.e. history information was recorded and this recording was selective in control, thereby introducing the concept of three control gates: an input gate, a forgetting gate and an output gate.

In the patent term semantic relation extraction task, the historical information and the future context information of the text are generally considered. However, the LSTM model only records historical information, and has no knowledge of future information. Unlike the LSTM model, the bi-directional LSTM model takes into account both past features (extracted by forward propagation) and future features (extracted by backward propagation). It is simply understood that the bi-directional LSTM model is equivalent to two LSTMs, one forward output sequence and one backward output sequence, and then the outputs of the two are combined as the final result. The bidirectional LSTM model effectively utilizes the context information of the patent text, and can dig out more implicit characteristics in the patent text.

The essence of the Attention mechanism is derived from the human visual Attention mechanism and is applied to the field of visual images. Bahdana et al used the Attention mechanism in Neural Machine Translation by Jointly Learning to Align and Translation, the first to apply the Attention mechanism to the field of natural language processing. The introduction of the Attention mechanism is increasing subsequently with the research of other topics in the field of natural language processing. The attention mechanism can make the model focus more on important information in patent text by computing attention probabilities to highlight the importance of particular words to the entire sentence.

In the part, an attention mechanism of a relation classification task is used for calculating the output of the Bi-LSTM model to obtain an attention probability distribution, the importance degree of the output state of the LSTM unit to relation classification at each moment is obtained from the attention probability distribution, and a sentence expression is learned, so that the final classification effect is improved. In this model, the formula used by the attention tier is as follows:

M＝tanh(H)

α＝softmax(w^TM)

output result H ═ H for Bi-LSTM model₁，h₂，h₃，...，h_T]Besides the attention mechanism, the statistical calculation method of the maximum pooling is also selectedAnd obtaining the feature representation which is most relevant to the classification task. Namely: h' ═ maxpool (H)

Step 4): fusing the global features and the local features;

the feature fusion is to combine the calculation results of the attention layer and the pooling layer to achieve the effect of complementary advantages among a plurality of features. Namely:wherein

Representing vector stitching.

Step 5): and outputting a classification result by using a softmax classifier.

Transforming the patent term relationship extraction problem into a multi-classification problem, the present invention predicts tags from a set of discrete classes Y of a sentence S using a softmax classifier

The classifier takes the result after feature fusion as input, and the formula is as follows:

the loss function used is a negative log-likelihood function of the true class label y and uses L2 regularization to prevent overfitting, the calculation formula is as follows:

wherein the content of the first and second substances,a one-hot form representing the real category label y,

l2 regularization superparameter;

the model may train parameters. The model framework used in the present invention is illustrated in fig. 3.

The experimental data and evaluation criteria were as follows:

the data adopted by the experiment is 9978 patent texts of the new energy automobile field, which are crawled from a patent retrieval and analysis website, the final purpose of the experiment is to extract the relation among the field terms in the patent texts of the new energy automobile field, and as the field terms exist in each part of the patent texts, the abstract, the specification and the claim in the patent texts are used as linguistic data to extract the field term relation. Preprocessing the patent text data, and selecting 6912 corpora as experimental data, wherein 5248 corpora are used as training data, and 1664 corpora are used as test data. The specific data processing steps are as follows:

(1) performing term Extraction on the Patent data by using a Patent term Extraction algorithm of Lv, Xiagnru in Patent Domain Terminology Extraction Based on Multi-feature Fusion and BILSTM-CRF Model;

(2) forming a term dictionary by the extracted patent terms, adding the term dictionary into a jieba word segmentation tool, and performing word segmentation processing on the patent data;

(3) sentence breaking is carried out on the patent data according to commas, semicolons and sentence numbers, and each sentence belongs to a corpus;

(4) selecting sentences only containing two patent terms to form a final data set;

(5) and carrying out category marking on the screened data, and determining final experimental data.

6912 pieces of data selected in the experiment of the present invention contain 7 relationship types, the sample relationship is shown in table 1, and the sample is shown in table 2.

TABLE 1 sample relationships

Table 2 sample examples

Accuracy and recall are two metrics widely used in the fields of information retrieval and statistical classification to evaluate the quality of results. Wherein, the accuracy rate is the ratio of the number of the searched relevant documents to the total number of the searched documents, and the measurement is the precision rate of the searching system; the recall rate is the ratio of the number of the searched relevant documents to the number of all the relevant documents in the document library, and the recall rate of the search system is measured; and the F value is an evaluation index integrating the two indexes and is used for comprehensively reflecting the integral index.

In order to verify the correctness and the validity of the model provided by the invention, a macro _ averagedF1(macro _ F1) value is used as an evaluation index of an experiment, if a macro-averagedF1 value is to be calculated, the accuracy (Precision), Recall (Recall) and F1 values of each category need to be calculated, and the calculation formula is as follows:

wherein, TP_iIs shown as

The number of correctly predicted data in the relationship type; FP_iThe representation is mispredictedIs as follows

The number of data of the relationship type; FN (FN)_iIndicates to belong toThe number of data types that are relationship type but are mispredicted to other relationship types. The calculation formula of macro _ averagedF1 is as follows:

wherein M represents the number of relationship types.

The parameter settings and results were analyzed as follows:

the running environment of the experimental model is a 64-bit Ubuntu16.04 operating system installed on a Dell server, NVIDIA Tesla K40 GPU, and the running memory is 64 GB. The model is implemented using the TensorFlow framework, python language. The final patent term relationship extraction effect of the model has close relationship with the parameters in the model, the local optimal value of each parameter is obtained through a large number of parameter adjusting experiments, and the specific parameter setting is shown in table 3. The final experimental results of this experiment are shown in table 4.

TABLE 3 model parameter settings

TABLE 4 Final Experimental results

It can be seen from the experimental results of each relationship type in table 4 that the simplicity and complexity of the relationship type affect the final effect of the relationship extraction, because the relationship type (e.g. spatial relationship) is simple, the easier it is to learn by the model, the more accurate the relationship type identification is, the relationship type (e.g. generic relationship) is complex, and the less semantic association the model is able to obtain in the learning process, the lower the identification effect of the relationship type is.

The internal experimental comparisons of the model are as follows:

in order to verify the validity of extracting patent term relations by adding the keyword features and the pooling layer into a Bi-LSTM model based on an Attention mechanism, the invention designs four groups of models and carries out analysis by internal comparison experiments, wherein the original input of the models is sentence word vectors, position feature vectors and term feature vectors. The results of the experiment are shown in table 5 and fig. 4. The results of the various classes of experiments in each experiment are shown in table 6.

TABLE 5 comparison of model internal experiments

Comparison of the experimental results for each class of the model in Table 6

As can be seen from the accuracy, the recall rate and the F1 value of each group of experiments shown in tables 5 and 6 and shown in FIG. 4, the addition of the key word features and the pooling layer in the Bi-LST model based on the Attention mechanism designed by the invention has a good effect, and the relationship between the patent terms in the field of new energy vehicles can be effectively extracted. In experiment 1, only the Attention + Bi-LSTM model is used, and although a certain effect is obtained, the problem of patent term relation extraction can be solved to a certain extent, the final extraction result still needs to be improved. Experiment 2 has increased the keyword characteristic on experiment 1's basis, and experiment 3 has increased the pooling layer on experiment 1's basis, and these two sets of experiments are for experiment 1, and the experimental effect all promotes to some extent, therefore keyword characteristic and pooling layer have all played certain effect to improving the extraction of patent term relation. Experiment 2 is 0.95% higher than the F1 value of experiment 1, and experiment 3 is 0.42% higher than the F1 value of experiment 1, so that it can be concluded that the keyword features play a greater role in improving the extraction of patent term relationships than the pooling layer. The reason is that the addition of the keyword features improves the discrimination of the patent term relationship categories, and simultaneously makes up the deficiency of the Attention + Bi-LSTM model automatic learning features, so that the explicit addition of the keyword features can play a certain role in extracting the patent term relationship.

Therefore, the invention designs that the keyword characteristics and the pooling layer are simultaneously added into the Bi-LSTM model based on the Attention mechanism, and the experiment 4 can obtain that the keyword + Attention + Bi-LSTM + pooling layer model can achieve better experiment effect than a general deep learning model.

The different classification methods were compared as follows:

in order to verify the advantages of the Attention + Bi-LSTM model in patent term relationship extraction, the Attention + Bi-LSTM model and the RNN, LSTM and Bi-LSTM models are compared on the same data set, for unifying the experimental standards, the input word vectors of all the models are the same and are in the vector format shown in FIG. 2, and the models are added with the pooling layer, and the experimental results are shown in Table 7, Table 8 and FIG. 5.

TABLE 7 comparison of different experimental methods

Comparison of the experimental results for each class of the model in Table 8

It can be seen from the experimental comparison of the different methods in table 7 and fig. 5 that the Bi-LSTM method shows better performance than the LSTM, RNN methods. This is because the Bi-LSTM model considers both past features (extracted by forward propagation) and future features (extracted by backward propagation), and effectively utilizes the context information of the patent text to extract more implicit features in the patent text. The Attention mechanism is added on the basis of the Bi-LSTM model, and the effect is further improved, because the Attention mechanism highlights the importance degree of a specific word to the whole sentence by calculating the Attention probability, so that the model can pay more Attention to the important information in the patent text. The effectiveness of the method of the invention in the extraction of patent term relationships is confirmed by comparing the above experiments.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A patent term relation extraction method based on Bi-LSTM and keyword strategy of attention mechanism is characterized by comprising the following steps:

step 4): fusing the global features and the local features;

step 5): and outputting a classification result by using a softmax classifier.

2. The attention mechanism-based patent term relationship extraction method for the Bi-LSTM and keyword strategy according to claim 1, wherein the TextRank algorithm improved in the step 1) is specifically as follows:

Step E: iteratively calculating a keyword graph G according to formula (1)_iUntil convergence, formula (1) is as follows:

wherein: w (v)_i) Is a node v_iThe weight of (2); d is a damping coefficient which represents the probability of pointing from a specific node to any other node in the graph and is set to be 0.85; in (v)_i) Representing a pointing node v_iA set of nodes of (c); out (v)_j) Representing a slave node v_jStarting fromThe node set pointed by the edge; w is a_jiRepresenting a node v_jTo v_iThe weight of the edge of (v), W' (v)_i)_TF-IDFRepresenting a node v_iA TF-IDF value of (1);

3. The attention mechanism-based patent term relationship extraction method for the Bi-LSTM and keyword strategy according to claim 1, wherein the step 2) is specifically: the formulas used in the attention layer are shown in (2), (3) and (4):

M＝tanh(H)(2)

α＝softmax(w^TM)(3)

where H is the matrix [ H ] output by the Bi-LSTM layer for T moments₁，h₂，h₃，...，h_T]And isd^wA dimension representing a word vector; w represents a training parameter vector and w^TRepresenting the transpose of w α representing the attention probability distribution vector h^*Representing the learned sentence representation.

4. The attention mechanism-based patent term relationship extraction method for the Bi-LSTM and keyword strategy according to claim 1, wherein the step 3) is specifically: the output result H of the Bi-LSTM model is calculated statistically by using a maximum pooling method, as shown in formula (5):

h′＝maxpool(H)(5)。

5. the attention mechanism-based patent term relationship extraction method for the Bi-LSTM and keyword strategy according to claim 1, wherein the step 4) is specifically: feature fusion is to combine the calculation results of the attention layer and the pooling layer, as shown in formula (6):

wherein

Representing vector stitching.

6. The attention mechanism-based patent term relationship extraction method for the Bi-LSTM and keyword strategy according to claim 1, wherein the step 5) is specifically: predicting tags from a set of discrete classes Y of sentences S using a softmax classifier

wherein the content of the first and second substances,

a one-hot form representing the real category label y,

l2 regularization superparameter;the model may train parameters.