CN110826332A

CN110826332A - GP-based automatic identification method for named entities of traditional Chinese medicine patents

Info

Publication number: CN110826332A
Application number: CN201911062344.4A
Authority: CN
Inventors: 张亚宇; 谷波; 钱宇华; 马国帅
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2019-11-02
Filing date: 2019-11-02
Publication date: 2020-02-21

Abstract

A GP-based automatic recognition method for named entities of traditional Chinese medicine patents is characterized in that automatic extraction of traditional Chinese medicine patent document features is achieved through active learning of a model, and then named entity labeling is achieved according to extracted feature information. The invention applies genetic programming to the recognition task of the named entities of the traditional Chinese medicine patents, so that the algorithm can be actively learned; compared with the mainstream deep learning method technology at the present stage, the method has fewer parameters and is easy to operate; in the learning process, the context information of the words is considered, and the context dependency relationship between the words is also considered, so that the information extraction is more sufficient; compared with the gate-based LSTM algorithm, the GP is used for searching the memory cells, so that a more complex operation structure can be found, more named entities can be found compared with the original method, and the performance of the algorithm is improved. The method is used for automatically identifying the named entities in the traditional Chinese medicine patent, and can also be expanded to related tasks such as keyword extraction and the like.

Description

GP-based automatic identification method for named entities of traditional Chinese medicine patents

Technical Field

The invention relates to the field of natural language processing, in particular to a GP-based automatic identification method for named entities of traditional Chinese medicine patents.

Background

The patent indexing is the core work of data deep processing, and various retrieval information in patent documents can be effectively extracted through indexing, so that the efficiency and the accuracy of patent document retrieval are improved. The patent data of traditional Chinese medicine as a kind of patent data with important value comprises a large amount of professional vocabularies (traditional medicines, compounds and the like), and the automatic indexing work is extremely difficult. In auto-indexing, named entity recognition is again one of the tasks where the first step is important, and the results of named entity recognition can affect the following tasks.

Named Entity Recognition (NER), a key task in natural language processing, aims to recognize entities with specific meanings in texts, so that the technology can recognize related entities such as drugs and compounds in traditional Chinese medicine patent data and is an effective method. For a long time the NER approach mainly comprises three main categories: (1) rule/dictionary based methods; (2) methods based on traditional machine learning (e.g., CRF, HMM, MEMM); (3) a deep learning based approach; in recent years, the method based on deep learning is the mainstream method in the named entity recognition task, and good effects are obtained, wherein the method comprises Bi-LSTM, LSTM + CRF, RNN + CRF, existing Attention-based, transfer learning-based neural network structures and the like. The named entity recognition based on the traditional method generally needs word segmentation processing on a text, and the quality of a word segmentation result directly affects an experimental result, particularly traditional Chinese medicine patent data with more professional vocabularies; the deep learning-based method has various network types, great dependence on parameter setting and poor model interpretability.

Genetic Programming (GP), a branch of the field of evolutionary computing, is a method that mimics human intelligence, and can select computer programs by autonomous learning to solve tasks given in advance, with fewer parameters and ease of operation. The invention provides a named entity automatic identification method based on Genetic Programming (GP) by combining a traditional Chinese medicine patent naming identification task.

Disclosure of Invention

The invention aims to solve the following problems:

(1) the method has the advantages that the method carries out named entity recognition on the medicines and compounds in the Chinese medicine patent data, and facilitates automatic indexing of the patent data in the later period;

(2) a universal named entity recognition model is designed, the model can actively learn according to different data types, and excessive human participation is not needed;

(3) the optimization is carried out through a GP algorithm, so that the problems of gradient disappearance, gradient explosion and the like when the optimization problem is solved through the traditional deep learning are avoided;

(4) the GP algorithm is adopted to search the input and output points of the model, so that more complex structures can be found, more named entities can be found compared with the original method, and the method has better performance;

(5) compared with the mainstream deep learning method technology at the present stage, the method has the advantages that the parameters are fewer, the operation is easy, and errors caused by word segmentation in the traditional method are avoided.

The invention discloses a GP-based automatic identification method for named entities of traditional Chinese medicine patents, which has the advantages of fewer parameters, easy operation, sufficient extracted information and good performance.

A GP-based automatic recognition method for named entities of traditional Chinese medicine patents is characterized in that automatic extraction of document features is achieved through active learning of a model, and then named entity labeling is achieved according to extracted feature information.

The method comprises the following specific steps:

the first step is as follows: preparing data: data cleaning, namely manually marking named entities in Chinese patent documents;

the second step is that: structured representation of data: converting 'Chinese characters' in training data into a vector form by a character embedding method, wherein each 'Chinese character' is embedded into a vector with l dimension;

the third step: and in the model learning process based on the GP algorithm, the training is carried out sentence by sentence.

1. Local information extraction for words

A. Context information representation of words

For each sentence in the data set, word vectors of the words contained therein are sequentially concatenated into a matrix form, for example: sentences of length s can be represented structurally as matrices of dimension s x l. When extracting the local information of the word, setting a window to be 5 x l, namely simultaneously considering the information of the first two words and the information of the second two words in the extraction process of the local information of the word, and completing the current word by using a 0 vector if the current word is not followed or followed by two words; the context information for each word can be represented as a matrix of 5 x l, denoted P₁₁，P₁₂，...，P_5lTo index the values in the matrix, figure 1 gives the word "x_i"context information matrix representation.

B. Local information extraction for words

Through A, each word in the sentence is corresponding to a context information matrix, and the local information extraction process of the Chinese character converts the Chinese character into a vector form containing local information according to the context information matrix by learning a tree structure, and the vector is recorded as y_t＝[y_t1,y_t2,....,y_tm]And t is 1, 2.. times.s, the invention sets s words in a sentence, and the dimension of the formed local information 'word vector' is m. The specific process is as follows:

(1) randomly initializing a tree structure of T subtrees, wherein leaf nodes of the tree are indexes in a word context information matrix, intermediate nodes are a plurality of operators and elementary functions (each subtree is actually a function expression, and a variable is the index of the leaf node) given at random, and the output of the m subtrees is multiplied by C and spliced into a vector y with dimension m_tI.e. the root of the tree, C is a coefficient between (0, 0.5), and we refer to this type of tree structure as the first genetic programming tree, abbreviated GP1, one of which is shown in fig. 3. In fact, each tree is a local information extraction model.

(2) Inputting the context information matrix corresponding to each word in the sentence into the initialized T trees to form T y vectors, and using the formed vectors as the input of the next process

(3) The optimization process of the tree structure is realized through self-evolution, and an optimal tree structure is finally learned, so that the local information of the character can be extracted to the maximum extent (the fourth step of the self-evolution process of the tree will give a detailed description).

2. Sequence labeling

For the sequence data of Chinese medicine patent types, the long dependency relationship often exists between words, and is inspired by the learning process of LSTM (Long Short Term Memory networks), the invention provides a genetic programming tree with long and Short Memory capacity.

Through the above steps, each word is represented as a vector y with dimension m_t. A T tree structure is also initialized, here denoted as a second genetic programming tree, abbreviated as GP2, with the T outputs of the tree structure in GP1 corresponding to the inputs to GP2, but with a clear distinction from GP 1.

The structure of GP2 is described below:

(1) the learning process GP2 for each word marker corresponds to three inputs (i.e., there are three types of inputs for the leaf nodes of the tree): input of itself y_tOutput of the previous word h_t-1Memory cell C produced by the previous process_t-1Node values (first character h) of corresponding dimensions of three types of vectors_t-1And C_t-1Then randomly given); the three inputs are input into nodes at the upper layer of the tree in a full connection mode;

(2) the output of GP1 is an m-dimensional vector, while GP2 corresponds to two types of outputs: h is_tAnd C_t。h_tOutputting h for the learned information of a certain dimension of the current word vector_tRespectively input to the following three positions:

a. input to higher layers;

b. as input for the next word;

c. as input for the next iteration.

C_tCompared with the traditional gate-based LSTM model, the GP tree-based learning process can learn more complex functional relationship to enrich the memory cell C for the long-term memory cell to store the long-term dependency information among word sequences_t；

(3) In addition, the intermediate nodes of GP2 and GP1 are also different: the intermediate nodes of GP1 are randomly given operators and elementary functions, and GP2 additionally adds three types of activation functions commonly used in deep learning: sigmoid, tanh, relu.

A simple illustration of the sequence learning (FIG. 3) and the GP2 used in the learning process (FIG. 4) are given below, respectively, and as can be seen from FIG. 3, the output is provided to the higher layer h_tThe probabilities of all tokens can be output by the softmax transform, and the token corresponding to the probability maximum is considered to be the token of the word. The GP2 is also learned through a form of self-evolution.

For example: the final output of the 'Goujin' is [0.6, 0.1, 0.3], i.e. the probabilities marked as 'B', 'I' and 'O' are 0.6, 0.1 and 0.3 respectively, and obviously, the probability that the model predicts that the output is 'B' is the maximum, and the model outputs are marked as 'B'.

The fourth step: function of adaptive value

T models are formed through the process, each model gives a corresponding label, a reasonable adaptive value function needs to be given for judging the quality of the model, and the probability that an individual (model) with a larger adaptive value is inherited to the next generation is larger.

The cross entropy is always used for measuring the difference information between two distributions, and for each sentence, when the learned label is more similar to the real label distribution, the corresponding cross entropy is also smaller. In the genetic operation, the larger the individual adaptive value is expected to be, the stronger the adaptive capacity is, in order to accord with the survival rule of a suitable person, the negative value of the cross entropy is supposed to be adopted as the adaptive value function, and in order to prevent overfitting, the width and the depth of the tree are constrained by the method. Thus, the corresponding fitness function is:

p_jiis the probability of the true mark being,is the corresponding prediction probability; n is a radical of_TkOf GPkDepth, D_TkFor the width of GPk, k is 1 or 2.

And (3) an evolution process of the tree structure: for the initialized tree (the depth of the tree is set to be not more than 10) the initialized tree is evolved into a final tree structure through operations of multiple selection, crossing and mutation:

selecting an operator: selecting by adopting a roulette method, selecting m trees with the largest adaptive value to enter the next generation, wherein the probability that the trees with the higher adaptive value are selected is higher;

and (3) a crossover operator: randomly selecting subtrees in the two individuals, and randomly exchanging the positions corresponding to the two subtrees;

mutation operator: the symbols in the tree or the subtrees are transformed randomly with a probability of 1%.

The fifth step: and selecting an optimal model. After the model training is finished, the final model is verified through a verification set, and the optimal tree structures (GP1 and GP2) in the verification set are selected as the final model.

And a sixth step: and testing the learned model on a test set.

According to the method, the automatic extraction of the document features is realized through the autonomous learning of the model, so that the named entity labeling is realized according to the extracted feature information without excessive manual participation, and the parameter quantity is small; in the process of extracting the information of the words, the context information of the words and the context dependency relationship between the words are considered, so that the information extraction is more sufficient; the GP algorithm is used for searching, so that a more complex structure can be automatically found, more named entities are labeled, and the performance of the algorithm is improved; compared with the mainstream deep learning method technology at the present stage, the method has fewer parameters and is easy to operate, and the problems of gradient loss and gradient explosion in the deep learning algorithm can be prevented.

Drawings

FIG. 1 is the word "x_i"is represented by a context information matrix;

FIG. 2 is a partial information extraction GP1 of a word;

FIG. 3 is a learning process for a text sequence;

FIG. 4 is the GP2 structure in the sequence model;

fig. 5 is a model learning flowchart.

Detailed Description

A GP-based automatic identification method for named entities of traditional Chinese medicine patents is characterized by comprising the following steps:

the first step is as follows: preparing data: data cleaning, namely manually marking named entities in Chinese patent documents; the first word of a named entity is marked "B" (begin), the second word of the named entity and the remaining words are marked "I" (inside), and words that are not named entities are marked "O" (out). An example of a labeled training data portion is shown in FIG. 1, with labeled words on the left and corresponding label forms for each word on the right, where "Goji" is labeled as the named entity.

The second step is that: structured representation of data: converting 'Chinese characters' in training data into a vector form by a character embedding method, wherein each 'Chinese character' is embedded into a vector with l dimension; the specific process is as follows:

1. firstly, the Chinese characters c in the training data_i，c_iExpressed as one-hot vector form, i ═ 1,2,. n; expressed as a one-hot vector (i.e., for m words (without repetition) in the training data, each word corresponds to an index of one dimension, for the ith word only the ith dimension is 1, and the other dimensions are all 0):

"carry": [1,0,0,0,......,0]

Taking: [0,1,0,0,......,0]

"from": [0,0,1,0,......,0]

'Qi': [0,0,0,1,......,0]

......

2. Taking 'Chinese character' represented by one-hot as input, training the 'Chinese character' into a character vector which is densely distributed and has a certain semantic relation through a word2vector model, wherein the vector length is l, and a new 'Chinese character' vector in a sentence is marked as x_i＝[x_i1,x_i2,......x_il]，i＝1,2,...,n；

The third step: model learning process based on GP algorithm

And training sentence by sentence.

1. Local information extraction for words

A. Context information representation of words

(1) Splicing the word vectors of each word in the sentence into a matrix form in sequence; for example: sentences of length s may be represented as a matrix of dimension s x l.

(2) For each word in the sentence, taking the word vector of the first two words and the last two words as context information (for the current word, the front word and the back word are not filled with 0 vector), and representing each word as a matrix of 5 x l₁₁，P₁₂，...，P_5lTo index the values in the matrix, figure 2 gives the word "x_i"context information matrix representation.

B. Local information extraction for words

Conversion of a word into a word vector y containing local information by learning a tree structure based on its context information matrix_t＝[y_t1,y_t2,....,y_tm]And t is 1, 2.. times.s, the invention sets s words in a sentence, and the dimension of the formed local information 'word vector' is m.

The specific process is as follows:

2. Sequence labeling

For the sequence data of Chinese medicine patent types, the long dependency relationship often exists between words, and is inspired by the learning process of LSTM (Long Short Term Memory networks), the invention provides the GP tree with long and Short Memory capability.

The structure of GP2 is described below:

(3) the learning process GP2 for each word marker corresponds to three inputs (i.e., there are three types of inputs for the leaf nodes of the tree): input of itself y_tOutput of the previous word h_t-1Memory cell C produced by the previous process_t-1Node values (first character h) of corresponding dimensions of three types of vectors_t-1And C_t-1Then randomly given); the three inputs are input into nodes at the upper layer of the tree in a full connection mode;

(4) the output of GP1 is an m-dimensional vector, while GP2 corresponds to two types of outputs: h is_tAnd C_t。h_tOutputting h for the learned information of a certain dimension of the current word vector_tRespectively input to the following three positions:

d. input to higher layers;

e. as input for the next word;

f. as input for the next iteration.

FIGS. 3 and 4 show a simple illustration of the learning of the l-sequence and the GP2 used in the learning process, respectively, as can be seen from FIG. 3, output to the higher level h_tThe probabilities of all tokens can be output by the softmax transform, and the token corresponding to the probability maximum is considered to be the token of the word. The GP2 is also learned through a form of self-evolution.

The fourth step: function of adaptive value

p_jiis the probability of the true mark being,

is the corresponding prediction probability; n is a radical of_TkIs the depth of GPk, D_TkFor the width of GPk, k is 1 or 2.

And a sixth step: and testing the learned model on a test set.

The overall algorithm flow is shown in fig. 5.

And (3) describing a model flow:

step 1: preparing data;

step 2: the data is structurally represented in a vector form;

step 3: circulating sentence by sentence;

step 4: respectively initializing T first genetic programming trees and T second genetic programming trees, and giving selection, intersection and variation parameters required in an algorithm;

step 5: extracting local information of the words in the data;

step 6: marking the characters in the sentence according to the initialized genetic programming tree;

step 7: calculating an adaptive value of the genetic programming tree according to the marking information of the characters;

step 10: and judging whether the algorithm meets a termination condition, if so, terminating the algorithm, and if not, entering step 11.

step 11: and respectively executing selection-cross-mutation operation on the first genetic programming tree and the second genetic programming tree to form a new population, and returning to execute step 5.

Claims

1. A GP-based automatic recognition method for named entities of traditional Chinese medicine patents is characterized in that automatic extraction of document features is achieved through active learning of a model, and then named entity labeling is achieved according to extracted feature information.

2. The GP-based automatic identification method for named entities of traditional Chinese medicine patents according to claim 1, comprising the following steps:

(1) data cleaning, namely manually marking named entities in Chinese patent documents;

(2) converting 'Chinese characters' in training data into vector form by a character embedding method, wherein each 'Chinese character' is embedded intolA vector of dimensions;

(3) model learning based on GP algorithm;

(4) the fitness function is:

is the probability of the true mark being,

is the corresponding prediction probability;

is GPkThe depth of (a) of (b),

is GPkK is 1 or 2;

(5) after the model training is finished, verifying the final model through a verification set, and selecting the optimal tree structure under the verification set as the final model;

(6) the final model is tested on the test set.

3. The GP-based automatic recognition method for named entities of traditional Chinese medicine patents according to claim 2, wherein a model learning process based on GP algorithm comprises the following steps:

Step1: separately initializingTA first genetic programming tree and a second genetic programming tree are provided, and selection, intersection and variation parameters required in the algorithm are given;

Step2: representing words in the sentence in a matrix form containing context information;

Step3: extracting local information of the words in the data to form a new word vector;

Step4: inputting the formed new word vectors into a second genetic programming tree in sequence, and labeling the words in the sentence;

Step5: calculating an adaptive value of the genetic programming tree according to the marking information of the characters;

Step6: judging whether the algorithm meets the termination condition, if so, ending the training of the current sentence, and turning tostep2, training the next sentence until all sentences are trained; if not, enterstep7；

step7: performing selection-crossover-mutation operations on the first genetic programming tree and the second genetic programming tree, respectively, to form a new population, and returning to performstep5。