CN113065350A

CN113065350A - Biomedical text word sense disambiguation method based on attention neural network

Info

Publication number: CN113065350A
Application number: CN202110395920.8A
Authority: CN
Inventors: 逄淑阳; 张春祥; 王明磊
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-07-02

Abstract

The invention relates to a biomedical text word sense disambiguation method based on an attention mechanism (attention mechanism), an Asymmetric Convolutional Neural Network (ACNN) and a Bidirectional Long Short Term Memory network (Bi-LSTM). Firstly, processing a biomedical MSH corpus, and performing word segmentation, part-of-speech tagging and semantic tagging on English sentences containing ambiguous vocabularies to obtain processed training corpus and test corpus; then training the model by using the training corpus to obtain an optimized attention neural network model; on the optimized model, disambiguating the test corpus to obtain probability distribution of ambiguous vocabularies under each semantic category; the semantic category with the highest probability is the semantic category of the ambiguous vocabulary. The invention realizes good disambiguation on biomedicine ambiguous words and more accurately judges the real meaning of the biomedicine ambiguous words.

Description

Biomedical text word sense disambiguation method based on attention neural network

The technical field is as follows:

the invention relates to a biomedical text word sense disambiguation method based on an attention neural network, which is well applied to the field of natural language processing.

Background art:

biomedical texts are now so large that automated tools are needed to process them efficiently. However, automated processing of biomedical text is a difficulty. The reason for this is that many ambiguous words exist in the biomedical field. Determining semantic categories for biomedical words facilitates automatic processing of biomedical articles. At present, biomedical word sense disambiguation has been widely applied to biomedical natural language processing tasks, such as text indexing, text classification, named entity extraction, and the like.

Biomedical word sense disambiguation methods can be divided into three categories: supervised, unsupervised and knowledge-based approaches. In a supervised approach, classifiers are trained using labeled datasets and lexical and syntactic information in context to predict the correct sense of a biomedical word in a test dataset. In an unsupervised approach, unlabeled biomedical text is used to provide a choice of meaning for the biomedical vocabulary. In knowledge-based vocabulary classification, narrative and semantic tables are employed to determine semantic categories of biomedical vocabulary. In recent years, deep learning algorithms have been widely applied to biomedical word sense disambiguation, such as convolutional neural networks and cyclic neural networks, among others. In convolutional neural networks, the weights of the neurons are shared. Therefore, the neurons can share resources, the complexity of a network model is reduced, and the over-fitting phenomenon is prevented. The method has a very good effect on text processing in the recurrent neural network. For biomedicine ambiguous words, a deep learning algorithm can be well applied to disambiguation, and correct semantic classification is realized.

The invention content is as follows:

the invention discloses a biomedical text word sense disambiguation method based on an attention neural network, aiming at solving the problem of word ambiguity in the field of natural language processing.

Therefore, the invention provides the following technical scheme:

1. an attention neural network-based biomedical text word sense disambiguation method, comprising the steps of:

step 1: performing word segmentation, part-of-speech tagging and semantic information tagging on all biomedical ambiguous words and sentences contained in the MSH corpus, and selecting the morphological, part-of-speech and semantic information of four adjacent word units on the left and right of the biomedical ambiguous words as disambiguation characteristics.

Step 2: and (3) extracting the morphological, part of speech and semantic information of four adjacent Word units on the left and right of the biomedical ambiguous Word, and generating a corresponding Word vector by using the corpora which is trained and processed by Word2 vec. And selecting a small part of generated sentences as test data, and the rest of the sentences as training data.

And step 3: the training includes two processes, forward propagation and backward propagation. And training data is used as input of the attention neural network model training, and the optimized attention neural network model is obtained through the training of the attention neural network model.

And 4, step 4: the testing process is a forward propagation process, i.e. a semantic classification process. And inputting test data on the optimized attention neural network model, and calculating the probability distribution of the biomedical ambiguous words under each semantic category, wherein the semantic category with the maximum probability is the semantic category of the biomedical ambiguous words.

2. The biomedical text word sense disambiguation method based on the attention neural network as claimed in claim 1, wherein in step 1, word segmentation, part of speech tagging and semantic information tagging are performed on a chinese sentence, and disambiguation features are extracted, specifically comprising the steps of:

step 1-1, segmenting words of English sentences according to spaces in the sentences;

step 1-2, performing part-of-speech tagging on the segmented vocabulary by using a part-of-speech tagging tool;

step 1-3, semantic annotation is carried out on the segmented words by utilizing a semantic annotation tool;

and performing part-of-speech tagging and semantic tagging on all English sentences contained in the corpus by using an English part-of-speech tagging tool and an English semantic tagging tool, and selecting the morphological, part-of-speech and semantic information of four adjacent word units on the left and right of the biomedical ambiguous word as disambiguation characteristics.

3. The method for disambiguating Word senses in biomedical texts based on attention neural network as claimed in claim 1, wherein in said step 2, based on biomedical MSH corpus, Word2vec is used for extraction training to generate corresponding Word vectors, and the specific steps are as follows:

step 2-1, extracting the morphological, part of speech and semantic information of four adjacent word units on the left and right of the biomedical ambiguous word;

step 2-2, a CBOW model in Word2vec is used for obtaining a Word vector corresponding to each disambiguation feature, a small part of processed sentences are selected as test data, and the rest are used as training data.

4. The biomedical text word sense disambiguation method based on the attention neural network as claimed in claim 1, wherein in the step 3, the attention neural network model is trained, and the specific steps are as follows:

and (3) forward propagation process:

step 3-1, inputting training data into the initialized attention neural network model;

step 3-2, extracting disambiguation characteristics through an attention layer, and dynamically capturing the relation between words;

and 3-3, extracting more disambiguation characteristics through the asymmetric convolution layer. The asymmetric convolution can obtain different characteristic information according to convolution kernels with different sizes, meanwhile, the calculated amount can be reduced, the model calculation speed is increased, and overfitting is effectively prevented;

3-4, acquiring effective characteristic information from a forward network and a backward network through a bidirectional long-short term memory network layer, splicing the information and inputting the information into a full-connection layer, reducing the dimension of the extracted disambiguation characteristics, and connecting the disambiguation characteristics into a one-dimensional disambiguation characteristic vector; .

Step 3-5 utilizes softmax layer to calculate biomedical ambiguous vocabulary m in each semantic category s_i(i 1, 2.., n), the softmax function being as follows:

wherein, a_iInput data representing the softmax layer, P(s)_i| m) represents the biomedical ambiguous vocabulary m in semantic category s_iThe probability of occurrence of (i ═ 1, 2.., n).

Step 3-6 fromP(s₁|m)、P(s₂|m)、...、P(s_nAnd | m) selecting the maximum probability as the prediction probability.

Wherein y _ predicted_jRepresenting the predicted probability of the biometrically ambiguous vocabulary m.

Step 3-7 predicts the probability y _ predicted_jAnd true probability y_jA comparison is made and the error loss is calculated using a cross entropy loss function.

The error loss is calculated as follows:

wherein, y_jMeaning that the biomedicine ambiguous word m belongs to the semantic class s_iThe true probability of.

And (3) a back propagation process:

updating parameters layer by layer according to the error loss back propagation, wherein the parameter updating process is as follows:

where θ denotes a parameter set, θ' denotes an updated parameter set, and a denotes a learning rate.

And continuously iterating the attention neural network model to obtain the optimized attention neural network model.

5. The biomedical text word sense disambiguation method based on the attention neural network as claimed in claim 1, wherein in the step 4, the biomedical ambiguous word m is semantically classified by:

and (3) semantic classification process:

step 4-1, inputting the test data into the optimized attention neural network model;

step 4-2, dynamically capturing the relation between words through an attention layer;

step 4-3, more effective information is extracted and the calculated amount is reduced through the asymmetric convolution layer;

step 4-4, respectively acquiring information from a forward network and a backward network through a bidirectional long-short term memory network layer, splicing, entering a full connection layer, reducing the dimension of the extracted disambiguation features, and connecting into a one-dimensional disambiguation feature vector;

step 4-5 utilizes the softmax layer to calculate the probability distribution of the biomedical ambiguous vocabulary m under each semantic category. The semantic category s' with the maximum probability is the semantic category of the biomedical ambiguous vocabulary.

The semantic class s' is determined as follows:

wherein s' represents the semantic class with the highest probability, n represents the number of semantic classes, P(s)₁|m),...,P(s_i|m),...,P(s_n| m) represents the probability distribution sequence of the biomedical ambiguous vocabulary m under the semantic category.

Has the advantages that:

1. the invention relates to a biomedical text word sense disambiguation method based on an attention neural network. The English sentence is subjected to word segmentation, part of speech tagging and semantic information tagging. Based on the biomedical MSH corpus, Word vectors of sentences are extracted by using Word2vec, and the trained Word vectors are used as disambiguation characteristics. The extracted disambiguating features are of higher quality.

2. The model used by the invention mainly comprises an attention mechanism, an asymmetric convolution neural network and a bidirectional long-time and short-time memory neural network. The attention mechanism can dynamically capture the relation between words, the non-butt convolution neural network not only has the advantages of local perception and parameter sharing of the convolution neural network, but also reduces the calculated amount to accelerate the training speed, can well process high-dimensional data, can acquire effective information from the front direction and the back direction by long-time memory neural network, and has a good effect on text processing. As long as the attention neural network model is trained, a better classification effect can be obtained.

3. The classifier used by the invention is a softmax classifier, and can not only solve the data processing of the second class classification, but also solve the data processing of the multi-class classification.

4. And when the model is trained, updating parameters by adopting a random gradient descent method. By calculating the error, the error returns along the original route through back propagation, namely, the error reversely passes through each intermediate hidden layer from the output layer, each layer of parameters is updated layer by layer, and finally the error returns to the output layer. And continuously carrying out forward propagation and backward propagation to reduce errors and update model parameters until the attention neural network model is trained. The parameters are continuously updated along with the back propagation of the errors, and the disambiguation accuracy of the whole attention neural network model on the input data is improved.

Description of the drawings:

fig. 1 is a flowchart of a biomedical text word sense disambiguation method based on an attention neural network according to an embodiment of the present invention.

FIG. 2 is a training process of a biomedical text word sense disambiguation method based on an attention neural network according to an embodiment of the present invention.

FIG. 3 is a testing process of a biomedical text word sense disambiguation method based on an attention neural network according to an embodiment of the present invention.

The specific implementation mode is as follows:

in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.

Take the disambiguation processing of the ambiguous word "ADA" in the english sentence "a message from ADA predicted Feldman" as an example.

The flow chart of the biomedical text word sense disambiguation method based on the attention neural network, disclosed by the embodiment of the invention, is shown in fig. 1 and comprises the following steps.

Step 1, the extraction process of the disambiguation characteristics is as follows:

english sentence: a message from ADA president Feldman.

Step 1-1, segmenting words of English sentences according to spaces in the sentences, wherein the word segmentation result is as follows: access from ADA president Feldman.

Step 1-2, performing part-of-speech tagging on the segmented vocabulary by using a part-of-speech tagging tool, wherein the part-of-speech tagging result is as follows: A/DT message/NN from/IN ADA/NNP president/NN Feldman/NNP.

Step 1-3, semantic labeling is carried out on the segmented words by using a semantic labeling tool, and the semantic information labeling result is as follows: a/angstrom.n.01 message/message.n.01 from/-1 ADA/adenosine _ deamidase.n.01 president/president.n.01 Feldman/-1.

The segmentation, part of speech tagging and semantic information tagging results of the English sentence containing the biomedical ambiguous word "ADA" are as follows: A/DT/angstrom.n.01 message/NN/message.n.01 from/IN/-1 ADA/NNP/adenosine _ deamidase.n.01 president/NN/president.n.01 Feldman/NNP/-1.

Step 2, using Word2vec to train the medical text to generate the disambiguation feature vector.

Step 2-1 extracts four adjacent vocabulary units on the left and right of the biomedical ambiguous vocabulary, namely "message/NN/message.n.01", "from/IN/-1", "predicted/NN/predicted.n.01" and "Feldman/NNP/-1", respectively, from the English sentence containing the biomedical ambiguous vocabulary "ADA". A total of 12 disambiguating features were extracted.

The word vector generated in the step 2-2 is 100 dimensions, and the word vector with 1200 dimensions is generated by splicing 12 disambiguation characteristics.

Step 3 the biomedical ambiguous word "ADA" has two semantic categories, namely American Dental Association and Adenosine Deaminase.

The embodiment of the invention relates to a training process of a biomedical text word sense disambiguation method based on an attention neural network and a testing process of the biomedical text word sense disambiguation method based on the attention neural network, which are shown in fig. 2 and fig. 3. The method specifically comprises the following steps:

and (3) forward propagation process:

step 3-1, inputting a feature vector formed by splicing 12 disambiguation features into an initialized attention neural network model as training data;

3-4, acquiring effective characteristic information from a forward network and a backward network through a bidirectional long-short term memory network layer, splicing the information into a full connection layer, reducing the dimension of the extracted disambiguation characteristics, and connecting the disambiguation characteristics into a one-dimensional disambiguation characteristic vector;

step 3-5, calculating the prediction probability of the biomedical ambiguous word "ADA" under semantic categories "American Dental Association" and "Adenosine Deaminase" by utilizing a softmax layer;

the calculation process of the softmax function is as follows:

wherein, a_sRepresenting the input data of the softmax layer, P (American deep Association | ADA) represents the probability of occurrence of the biomedical ambiguous word "ADA" under the semantic category "American deep Association", and P (Adenosine Deaminase | ADA) represents the probability of occurrence of the biomedical ambiguous word "ADA" under the semantic category "Adenosine Deaminase".

And 3-6, selecting the maximum probability from P (American Central Association) ADA and P (Adenosine Deaminase ADA) as the prediction probability.

y_predicted＝max(P(American Dental Association|ADA),P(Adenosine Deaminase|ADA))

Where y _ predicted represents the prediction probability of the ambiguous word "ADA," 94.47%.

And 3-6, comparing the predicted probability y _ predicted and the real probability y of the attention neural network, and calculating the error by using a cross entropy loss function.

The error calculation process is as follows:

loss_ADA＝(ylog(y_predicted)+(1-y)log(1-y_predicted))

therein, loss_ADAError representing the biomedical ambiguous word "ADA".

And (3) a back propagation process:

according to error loss_ADAAnd reversely propagating the error, and updating the parameters of each layer by layer, wherein the parameter updating process is as follows:

wherein, theta_ADAParameter set, θ ', representing the biomedical ambiguous vocabulary "ADA'_ADADenotes the parameter set after update, and a is the learning rate.

Step 4, model testing, namely a semantic classification process, specifically comprises the following steps:

and 4-5, calculating the probability of the biomedical ambiguous vocabulary "ADA" under each semantic category through a softmax layer, wherein the semantic category corresponding to the maximum probability is the semantic category of the ambiguous vocabulary.

The semantic class s' of the biomedical ambiguous word "ADA" is determined as follows:

wherein s' represents that the semantic category corresponding to the biomedical ambiguous word "ADA" is American digital Association, and P (s | ADA) represents the probability distribution of the biomedical ambiguous word "ADA" under each semantic category.

Through the attention neural network model, meaning disambiguation is carried out on English sentences "Address from ADA president Feldman" containing biomedical ambiguous words "ADA", and semantic categories corresponding to the ambiguous words "ADA" are American Dental Association (American Dental Association) and Adenosine Deaminase (Adenosine Deaminase).

According to the biomedical text word sense disambiguation method based on the attention neural network, accurate disambiguation characteristics can be selected, the semantic category of the biomedical ambiguous words can be determined by adopting the attention neural network model, and the accuracy is high.

The foregoing is a detailed description of embodiments of the invention, taken in conjunction with the accompanying drawings, wherein the specific embodiments are merely provided to assist in understanding the method of the invention. For those skilled in the art, the invention can be modified and adapted within the scope of the embodiments and applications according to the spirit of the present invention, and therefore the present invention should not be construed as being limited thereto.

Claims

2. The biomedical text word sense disambiguation method based on the attention neural network as claimed in claim 1, wherein in step 1, word segmentation, part of speech tagging and semantic information tagging are performed on an english sentence, and disambiguation features are extracted, and the specific steps are as follows:

and (3) forward propagation process:

3-4, acquiring effective characteristic information from a forward network and a backward network through a bidirectional long-short term memory network layer, splicing the information into a full connection layer, reducing the dimension of the extracted disambiguation characteristics, and connecting the disambiguation characteristics into a one-dimensional disambiguation characteristic vector; .

Steps 3-6 from P(s)₁|m)、P(s₂|m)、...、P(s_nAnd | m) selecting the maximum probability as the prediction probability.

The error loss is calculated as follows:

And (3) a back propagation process:

and (3) semantic classification process:

The semantic class s' is determined as follows: