CN108491382A - A kind of semi-supervised biomedical text semantic disambiguation method - Google Patents

A kind of semi-supervised biomedical text semantic disambiguation method Download PDF

Info

Publication number
CN108491382A
CN108491382A CN201810207213.XA CN201810207213A CN108491382A CN 108491382 A CN108491382 A CN 108491382A CN 201810207213 A CN201810207213 A CN 201810207213A CN 108491382 A CN108491382 A CN 108491382A
Authority
CN
China
Prior art keywords
sentence
data
word
text
carry out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810207213.XA
Other languages
Chinese (zh)
Inventor
李智
罗曜儒
李健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201810207213.XA priority Critical patent/CN108491382A/en
Publication of CN108491382A publication Critical patent/CN108491382A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The present invention is a kind of semantic disambiguation method for biomedical text polysemant.Include mainly:The vectorization for being carried out word to biomedical text using Word2Vec is indicated, the vectorization for being built context sentence to term vector language model based on two-way LSTM models is indicated, recycle the relationship of sentence vector space similarity, the label of existing mark medical data is passed to most like no labeled data by combination tag TRANSFER METHOD according to probability, and all labeled data is finally combined to carry out semantic disambiguation to biomedical text.Since biomedical data is with strongly professional, the features such as term is more manually carry out processing to medical data and take time and effort and error-prone, can then greatly reduce handmarking's cost using the present invention, simultaneously compared to traditional machine learning method, the accuracy of semantic disambiguation can be effectively improved.

Description

A kind of semi-supervised biomedical text semantic disambiguation method
Technical field
The invention belongs to natural language processing semantemes to disambiguate field, be that one kind is disappeared based on semi-supervised biomedical text semantic The method and system of discrimination.It specifically refers to based on label TRANSFER METHOD using two-way shot and long term memory models Bi-LSTM to medicine text Middle polysemant carries out semantic disambiguation.
Background technology
In recent years with the explosive increase of digital information, medical staff is increasingly easy to get medical electronics data. Biomedical sector, text data contains the knowledge and information of a large amount of professional domains, how from digitized text message It extracts useful information and becomes more and more important.Compared to generic text data, the difficult point of medicine text data be it is strongly professional, Data mark difficulty etc..Therefore understands that biomedical text semantic information and automation mark medical data become research heat Point.
Traditional biomedical text semantic disambiguation method includes supervised learning method, unsupervised learning method and is based on The learning method of knowledge base.Supervised learning method learns a potential grader using flag data, then utilizes the grader It is predicted for unknown data potential applications.This method usually requires a large amount of flag data to ensure the high precision of grader Rate, artificial annotation process take time and effort, therefore data volume is little in biomedical certain fields in the case that is not most Good selection.Unsupervised learning method does not need flag data, and data are sorted out only in accordance with potential similitude.Unsupervised learning Method enormously simplifies the engineering of artificial labeled data, however the accuracy of this method appoints so needs are further to improve, not It is suitble to the low fields of serious forgiveness such as medical domain.What the method in knowledge based library utilized is the medical knowledge library for having built and having increased income Training sample is done, this method advantage is data reliability height, the disadvantage is that construction of knowledge base expansibility is poor and difficult in maintenance.
Biomedical text semantic, which disambiguates, often utilizes term vector model by each word vectors in text, phrase semantic Information is stored in lower dimensional space in vector form, and similar semantic word is indicated with similar word vector.Common term vector conversion Technology has Word2Vec models comprising Skip-gram models and CBOW models.Wherein Skip-gram models utilize target word Predict that the term vector of adjacent windows word, CBOW models predict the term vector of target word using adjacent windows word.It is similar, sentence vector What is utilized is to merge the term vector feature of each word in sentence to indicate the semantic information of the sentence.Common traditional fusion side Method has cascade, is averaging, the methods of weighted sum, wherein Cascading Methods are the term vectors by each word in sentence according to front and back suitable Sequence direct splicing;Averaging method is that all word term vectors in sentence are obtained sentence vector by averaging method;Add It is that different weights are assigned to the importance of semantic information according to each word to weigh summation method, is then summed according to weights addition To sentence vector.Sentence vector is usually used in initializing language model as feature, and it is follow-up natural language processing task provider to make it Just.
Recurrent neural network(RNN)It is a kind of neural network model of common processing text message, its main feature is that can connect It connects in the information to the task at current time of previous time, there is certain Memorability.However when the processing long sentence period of the day from 11 p.m. to 1 a.m, theoretically RNN can handle long-term Dependence Problem.But in practice, Bengio.et al (1994) et al. have carried out the problem deep Research, find RNN can not successfully learn to these knowledge, when word spacing distance farther out when, RNN may cause gradient explode or Person's gradient disappears, and causes backpropagation to fail, can not be effectively retained text information.In order to overcome the disadvantage, it is proposed that RNN's Improved model --- shot and long term memory models(LSTM).The model increases three " door " structures newly on the basis of RNN internal structures, " forget door " determine to retain last moment information number, " input gate " determine to retain current time information number, " output Door " determine current time output information number.LSTM is believed by this special door selectivity using last moment Breath and current time information, the problem of dependence when effectively avoiding RNN long.
Semi-supervised learning was used successfully in semantic disambiguation task in recent years, and wherein bootstrapping algorithms can reach To preferable accuracy.Low recall rate grader can learn from the example that sub-fraction marks, and then be expanded with these sentences The set for opening up label utilizes the label that the grader is the high confidence level of unmarked corpus labeling.In recent years, it is proposed that one kind is used for The label propagation algorithm of word sense disambiguation.And with bootstrapping and support vector machines(SVM)Supervised classifier is compared Compared with.Label propagation can reach better performance because it by optimize global object come distribute label come, and The traditional algorithms such as bootstrapping are then that Case-based Reasoning local similarity propagates label.
Invention content
The present invention provides a kind of sides disambiguated based on semi-supervised learning and the biomedical text semantic of deep learning Method and system.It is of overall importance not strong to solve traditional disambiguation method to a certain extent, it is artificial to mark the problems such as difficult and with high costs, Improve the accuracy to biomedical text and the disambiguation of generic text semanteme.
The present invention is made of two large divisions:1. being melted to term vector based on two-way shot and long term memory network LSTM models Conjunction forms sentence vector, generates the semantic feature of sentence.2. the semi-supervised semantic disambiguation model based on label TRANSFER METHOD, utilizes label The similitude of data is that unlabeled data carries out automation mark and eliminates semantic ambiguity simultaneously.
The technical solution adopted by the present invention includes the following steps:
(One)Sentence vector is formed based on two-way shot and long term memory network LSTM models, generates the semantic feature of sentence
Two-way shot and long term memory network LSTM models include:Output layer, backward hidden layer, forward direction hidden layer, input layer composition.Its In, it is recycled there are six distinctive weights in each time step, six weights correspond to as follows:Input layer to it is preceding to it is rear To hidden layer(w1, w3), hidden layer to hidden layer(w2, w5), forward and backward hidden layer to output layer(w4, w6)
Hidden layer is LSTM models, and LSTM models are by three doors(forget gage、input gate、output gate)With one A mnemon(cell)Composition
Input of the term vector of each word as bidirectional circulating neural network LSTM, and it is common with the output of last moment To current output.The process is divided into three phases
First stage:By gate layers of forget by sigmoid functions come the information of selective filter last moment
Wherein,It is exported for last moment,Currently to input, i.e., current term vector,For 0 to 1 value, on filtering The information that one moment acquired
Second stage:It generates and needs newer new information
First by input gate layers which value determined to update by sigmoid
Then new candidate value is generated by one tanh layers
The candidate value of new informationRefreshed
Phase III:The output of model
An initial output is obtained by sigmoid layers
It then will by tanh functionsIt zooms in and out, the two is multiplied, and obtains the output of model
(Two)The semi-supervised semantic disambiguation model based on label TRANSFER METHOD
Label TRANSFER METHOD utilizes the similitude between sample data, and the label of flag data is passed to unlabeled data according to probability. It is all samples structure graph model first, wherein each sample is a node, nodeWithSimilarity calculation method be:
WhereinIt is hyper parameter.Each node is according to the similitude with surroundings nodes according to probability propagation label, method for calculating probability For:
The quantity of n representative edges.
Description of the drawings
Fig. 1 is present system schematic diagram.
Fig. 2 is LSTM internal structure charts of the present invention.
Specific implementation mode
(1)User inputs biomedical text, generates sentence vector characteristics
Biomedical text is divided into phrase form first, phrase Word2Vec models are then generated into term vector, then Term vector in each sentence is sequentially input into two-way shot and long term memory models, model will export two sentence vectors, respectively ForWith, new sentence vector is formed by cascade mode
New sentence vector is inputted again into multi-layer perception (MLP)Obtain final sentence vector
(2)Using label TRANSFER METHOD, automatic marking Unlabeled data, and disambiguated for ambiguity word
It will(1)Obtained sentence vector characteristics calculate the similitude of each node as vectorial node of graph, certainly according to label TRANSFER METHOD It moves and propagates most like label for unlabeled data, for ambiguity word, the semantic letter for best suiting sentence vector is also transmitted according to similitude Breath.
(3)Experimental result
According to step(1)With step(2), using international medicine text MSH WSD data sets and NLM WSD data sets.Its Middle MSH WSD data sets include 203 medicine ambiguity entities, share 37888 ambiguity sentences, wherein manually 37090 samples of mark; NLM WSD data sets include 50 ambiguity entities, contain 552153 common sentences, and wherein each ambiguity entity is manually labelled with 100 samples.This experiment uses 20:It is twentieth to increase former flag data from other medicine corpus at random for 1 ratio Data untagged is tested according to the semi-supervised model proposed by the present invention based on label TRANSFER METHOD, tests Comparative result It is as follows:
1 MSH WSD data set experimental results of table
2 NLM WSD data set experimental results of table
Wherein SVM indicates that using support vector machines, as model, LSTM indicates to use unidirectional shot and long term memory models, Bi-LSTM It indicates to use two-way shot and long term memory models;WE (Con) is indicated using cascade term vector as sentence semantics feature, WE (Avg) Indicate that WE (Wsum) indicates to make using weighted sum term vector method using term vector method is averaging as sentence semantics feature For sentence semantics feature, Con indicates the model of the invention used as sentence semantics feature;LP is indicated using proposed by the present invention Label TRANSFER METHOD.It can be seen that according to experimental result, language model proposed by the present invention does not need to after increasing without label data It is manually marked, alleviates the cost that healthcare givers manually marks, and also achieved most in the disambiguation of medicine text semantic Good accuracy, it was demonstrated that the invention is truly feasible effectively.

Claims (5)

1. a kind of semi-supervised biomedical text semantic disambiguation method, feature include the following steps:
(1)Vectorization expression is carried out to the word of medicine text based on Word2Vec language models
(2)On the basis of term vector, vectorization is carried out to the sentence of medicine text based on Model B i-LSTM based on two-way shot and long term It indicates
(3)Using sentence vector space similitude, automation mark is carried out to no labeled data based on label TRANSFER METHOD, and to ambiguity Word carries out semantic disambiguation.
2. according to claim 1 carry out vectorization expression based on Word2Vec language models to the word of medicine text, It is characterized in that:The word can include medicine proprietary term and generic text word simultaneously.
3. it is according to claim 1 based on two-way shot and long term memory models Bi-LSTM to the sentence of medicine text carry out to Quantization means, it is characterised in that:In the sentence of the described two-way shot and long term memory models Bi-LSTM inputs word of each word to Amount indicate, output be the sentence vectorization indicate.
4. according to claim 1 vector space similitude, it is characterised in that:Using Euclidean distance formula calculate sentence to Geometric distance between amount recycles geometric distance inverse to calculate sentence vector similarity.
5. according to claim 1 carry out automation mark based on label TRANSFER METHOD to no labeled data, and to polysemant Carry out semantic disambiguation, it is characterised in that:Using sentence similarity between vectors, medical data is marked and has passed the data label according to probability Unlabeled data is passed, it is automatic to carry out medical text data semanteme disambiguation.
CN201810207213.XA 2018-03-14 2018-03-14 A kind of semi-supervised biomedical text semantic disambiguation method Pending CN108491382A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810207213.XA CN108491382A (en) 2018-03-14 2018-03-14 A kind of semi-supervised biomedical text semantic disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810207213.XA CN108491382A (en) 2018-03-14 2018-03-14 A kind of semi-supervised biomedical text semantic disambiguation method

Publications (1)

Publication Number Publication Date
CN108491382A true CN108491382A (en) 2018-09-04

Family

ID=63339234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810207213.XA Pending CN108491382A (en) 2018-03-14 2018-03-14 A kind of semi-supervised biomedical text semantic disambiguation method

Country Status (1)

Country Link
CN (1) CN108491382A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377203A (en) * 2018-09-13 2019-02-22 平安医疗健康管理股份有限公司 Medical settlement data processing method, device, computer equipment and storage medium
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110705206A (en) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 Text information processing method and related device
CN111221960A (en) * 2019-10-28 2020-06-02 支付宝(杭州)信息技术有限公司 Text detection method, similarity calculation method, model training method and device
CN111414473A (en) * 2020-02-13 2020-07-14 合肥工业大学 Semi-supervised classification method and system
CN111597296A (en) * 2019-02-20 2020-08-28 阿里巴巴集团控股有限公司 Commodity data processing method, device and system
CN111881979A (en) * 2020-07-28 2020-11-03 复旦大学 Multi-modal data annotation device and computer-readable storage medium containing program
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113742458A (en) * 2021-09-18 2021-12-03 苏州大学 Natural language instruction disambiguation method and system for mechanical arm grabbing
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN115293158A (en) * 2022-06-30 2022-11-04 撼地数智(重庆)科技有限公司 Disambiguation method and device based on label assistance

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010985A2 (en) * 2000-07-28 2002-02-07 Tenara Limited Method of and system for automatic document retrieval, categorization and processing
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
US20140040275A1 (en) * 2010-02-09 2014-02-06 Siemens Corporation Semantic search tool for document tagging, indexing and search
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106919646A (en) * 2017-01-18 2017-07-04 南京云思创智信息科技有限公司 Chinese text summarization generation system and method
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN106997379A (en) * 2017-03-20 2017-08-01 杭州电子科技大学 A kind of merging method of the close text based on picture text click volume
CN107301213A (en) * 2017-06-09 2017-10-27 腾讯科技(深圳)有限公司 Intelligent answer method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010985A2 (en) * 2000-07-28 2002-02-07 Tenara Limited Method of and system for automatic document retrieval, categorization and processing
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
US20140040275A1 (en) * 2010-02-09 2014-02-06 Siemens Corporation Semantic search tool for document tagging, indexing and search
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106919646A (en) * 2017-01-18 2017-07-04 南京云思创智信息科技有限公司 Chinese text summarization generation system and method
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN106997379A (en) * 2017-03-20 2017-08-01 杭州电子科技大学 A kind of merging method of the close text based on picture text click volume
CN107301213A (en) * 2017-06-09 2017-10-27 腾讯科技(深圳)有限公司 Intelligent answer method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAYU YUAN等: "Semi-supervisedWord Sense Disambiguation with Neural Models", 《ARXIV:1603.07012V2[CS.CL]》 *
ZHENG-YU NIU等: "Word Sense Disambiguation Using Label Propagation Based Semi-Supervised Learning", 《PROCEEDINGS OF THE 43RD ANNUAL MEETING OF THE ACL》 *
李丽双等: "基于CNN-BLSTM-CRF模型的生物医学命名实体识别", 《中文信息学报》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377203A (en) * 2018-09-13 2019-02-22 平安医疗健康管理股份有限公司 Medical settlement data processing method, device, computer equipment and storage medium
CN111597296A (en) * 2019-02-20 2020-08-28 阿里巴巴集团控股有限公司 Commodity data processing method, device and system
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN110059185B (en) * 2019-04-03 2022-10-04 天津科技大学 Medical document professional vocabulary automatic labeling method
CN110287337A (en) * 2019-06-19 2019-09-27 上海交通大学 The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110705206A (en) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 Text information processing method and related device
CN111221960A (en) * 2019-10-28 2020-06-02 支付宝(杭州)信息技术有限公司 Text detection method, similarity calculation method, model training method and device
CN111414473A (en) * 2020-02-13 2020-07-14 合肥工业大学 Semi-supervised classification method and system
CN111414473B (en) * 2020-02-13 2021-09-07 合肥工业大学 Semi-supervised classification method and system
CN111881979A (en) * 2020-07-28 2020-11-03 复旦大学 Multi-modal data annotation device and computer-readable storage medium containing program
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN113742458A (en) * 2021-09-18 2021-12-03 苏州大学 Natural language instruction disambiguation method and system for mechanical arm grabbing
CN115293158A (en) * 2022-06-30 2022-11-04 撼地数智(重庆)科技有限公司 Disambiguation method and device based on label assistance
CN115293158B (en) * 2022-06-30 2024-02-02 撼地数智(重庆)科技有限公司 Label-assisted disambiguation method and device

Similar Documents

Publication Publication Date Title
CN108491382A (en) A kind of semi-supervised biomedical text semantic disambiguation method
Du et al. Text classification research with attention-based recurrent neural networks
CN108733742B (en) Global normalized reader system and method
Zhou et al. Recurrent convolutional neural network for answer selection in community question answering
CN109800437B (en) Named entity recognition method based on feature fusion
CN109543180A (en) A kind of text emotion analysis method based on attention mechanism
Li et al. Recognizing biomedical named entities based on the sentence vector/twin word embeddings conditioned bidirectional LSTM
CN113743099B (en) System, method, medium and terminal for extracting terms based on self-attention mechanism
Kim et al. Exploring convolutional and recurrent neural networks in sequential labelling for dialogue topic tracking
El-Kilany et al. Using deep neural networks for extracting sentiment targets in arabic tweets
Deng et al. Self-attention-based BiGRU and capsule network for named entity recognition
Cao et al. Stacked residual recurrent neural network with word weight for text classification
Lu et al. Incorporating domain knowledge into natural language inference on clinical texts
Song et al. A method for identifying local drug names in xinjiang based on BERT-BiLSTM-CRF
Kumar et al. Deep learning-based frameworks for aspect-based sentiment analysis
He et al. Multi-level attention based BLSTM neural network for biomedical event extraction
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN106815211B (en) Method for document theme modeling based on cyclic focusing mechanism
Yuan A joint method for Chinese word segmentation and part-of-speech labeling based on deep neural network
Diao et al. Leveraging integrated learning for open-domain Chinese named entity recognition
CN111723301B (en) Attention relation identification and labeling method based on hierarchical theme preference semantic matrix
WO2016161631A1 (en) Hidden dynamic systems
Kumar Disambiguation Model for Bio-Medical Named Entity Recognition
Yang et al. Service component recommendation based on LSTM
Liu Network security entity recognition methods based on the deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180904