CN111966780A - Retrospective queue selection method and device based on word vector modeling and information retrieval - Google Patents

Retrospective queue selection method and device based on word vector modeling and information retrieval Download PDF

Info

Publication number
CN111966780A
CN111966780A CN201910438020.XA CN201910438020A CN111966780A CN 111966780 A CN111966780 A CN 111966780A CN 201910438020 A CN201910438020 A CN 201910438020A CN 111966780 A CN111966780 A CN 111966780A
Authority
CN
China
Prior art keywords
word
retrospective
patient
query
queue selection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910438020.XA
Other languages
Chinese (zh)
Inventor
王嫄
孔娜
张雪
王栋
赵婷婷
王洁
史艳翠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Science and Technology
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN201910438020.XA priority Critical patent/CN111966780A/en
Publication of CN111966780A publication Critical patent/CN111966780A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A retrospective queue selection method and device based on word vector modeling and information retrieval. Aiming at the problems of low recall rate, poor accuracy and incomplete retrieval information of a retrospective queue selection method, named entity identification is carried out by introducing a word clustering method during electronic health case information preprocessing, a processing method which is adopted conventionally and aims at negative words is improved, vector representations of different entities are obtained by utilizing skip-gram algorithm to learn embedding of medical concepts in electronic health cases, vector representations of patients are modeled, patient representations are learned from the medical concepts, concepts related to query are obtained by query expansion, and then the relation between each patient and a query vector is measured by utilizing cosine distances respectively, and the patients are sequenced and output according to the distances. The invention has reasonable design, and can effectively improve the accuracy of semantic matching and the recall rate of queue selection after improving the preprocessing method of the negative words.

Description

Retrospective queue selection method and device based on word vector modeling and information retrieval
Technical Field
The invention belongs to the field of intelligent search, relates to named entity identification, negative part processing, word embedding, query expansion and queue selection, and particularly relates to a retrospective queue selection method based on word vector modeling and information retrieval in the medical field environment.
Background
Electronic health case cohort selection is one type of information search, and electronic health case-based research provides countless opportunities for biomedical research and precision medicine. For this aspect of research, there is a need to enable reliable detection of patients with a particular disease or condition for cohort studies. However, it is not a trivial matter in an electronic case system to accurately identify patients with a particular disease due to limitations in input errors, coding bias, medical reporting bias, data availability, data structure, and accuracy of identification. Defining cases, disease groups by a single clinical concept (e.g., international statistical classification coding of diseases and related health problems) is often not sufficient to produce reliable results. In addition, the performance of these concepts in identifying different diseases varies greatly, and problems such as low recall rate, easy failure of semantic matching, incomplete search information, etc. are encountered when selecting a queue. The following efforts have been made to address these problems.
Firstly, a clustering technology is simply used for searching, and similar information is organized together to form a class according to a certain similarity measurement method. During retrieval, the similarity between the query vector and each class is calculated, those classes with similarity to the query vector above a certain threshold are selected, and then the similarity of the query vector to each vector in the classes is calculated, wherein the top R closest items are sorted back. Secondly, text semantic modeling is carried out, and LDA is the most classical text potential theme modeling method with the strongest interpretability. The method is a three-layer Bayes probability model, comprises three layers of structures of words, subjects and documents, is an unsupervised machine learning technology, can be used for identifying potential subject information in a large-scale document set or a corpus, and adopts a bag-of-words method, wherein each method is regarded as a word frequency vector, so that text information is converted into digital information easy to model. Topic modeling techniques have been used to map queries and documents into a potential space for the first time and then match them against it. In the data preprocessing of electronic health cases, the following steps are generally taken: firstly, the text content is subjected to word segmentation, and then stop words such as 'thus', 'normal', negative words and the like which have no use value in the text content are removed from the segmented result. Since the electronic health case is professional description of the health condition of the patient by the doctor, the extremely strong professional property and writing habits of the doctor need to be considered. One language feature that is evident in electronic health cases is the frequent application of negative words, for example, denial of a symptom can be expressed as: form of "none/not smelling and/repudiation/not describing/not accompanying + symptoms". If these frequently occurring negative words are not handled well, we will have fatal misleading to our search work. A common way to handle negative words is to use ConText6 to delete all negative parts from the medical record before indexing. However, directly deleting all negative parts greatly affects the precision and effectiveness of the retrieval.
The method is used for queue selection, and problems of low recall rate, incomplete retrieval information and the like can be caused. The main reason is that the method is based on the precise matching of words or entities, the invention improves the preprocessing method aiming at negative words, learns the embedding of medical concepts by using a deep learning algorithm, and obtains the concepts related to diseases by query expansion, and comprises the following steps: the system has the advantages that the patients are sorted and output according to the distance between the patients and the query, so that the semantic matching accuracy can be effectively improved, more accurate information matching can be quickly realized, and the recall rate of queue selection is improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, solve the problems of low recall rate and easy failure of semantic matching and incomplete retrieval information of the traditional queue selection method, and effectively improve the accuracy rate of the queue identification field of electronic health cases by retrospective queue selection based on word vector modeling and information retrieval.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
step 1: electronic health case data preprocessing.
Step 2: named entity recognition introducing word clustering.
And step 3: and processing the negative word part in the electronic health case.
And 4, step 4: medical concept embedding is learned using the skip-gram algorithm.
And 5: patient characterization is obtained from medical concept embedding.
Step 6: and obtaining the concepts relevant to the query through query expansion.
And 7: and measuring the distance between each vector after query expansion and each patient candidate document and outputting the vectors in sequence.
In step 2, named entity recognition is carried out on the electronic health case data, and a word clustering method is introduced to update the named entity recognition result. Similar information is organized together to form a class according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class. And finally, improving the result of the initial named body recognition by considering the division of the entity class.
In step 3, after the electronic health case data is preprocessed in step 1, the negative word part in the electronic health case is specially processed. For example, news-like text presents an obvious domain characteristic that full-text content can be spread out closely around the center of a meeting point of a news headline, while medical-like text more frequently presents the semantic "elimination method" to approach the center of the meeting point in terms of expression. For electronic health cases, the appearance of negation words is extremely frequent, e.g., negation of symptoms is often expressed as: no/no smell/repudiation/no description/no accompanying + symptoms. The general approach to negation words is to delete all negative parts from the medical record before indexing using ConText6 directly. The invention adopts a 'reverse selection' method to carry out improved processing on the negative part appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And finally, deleting the intersection set with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.
In step 4, for each patient, a word vector modeling method is used to model the patient.
In step 5, for each region of the patient clinical case divided by time intervals, a simple sentence aggregation method is used to obtain patient characterization from the medical concept.
In step 6, for each queue to be queried, we obtain its related concepts through query expansion. Adding the query expansion words into the original query, respectively measuring the relation between each patient and the query vector by using cosine distance, and taking the vector of the patient closest to the cosine distance as a final score. Finally, this operation is repeated for all vectors of the expanded words.
In step 7, the patients are ranked according to their distance from the query, and the candidate documents are output according to the ranking.
Drawings
FIG. 1 is a schematic flow diagram of the system of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Step 1: pre-processing of electronic health case data.
The method mainly comprises the following steps:
(1) data word segmentation
The invention can perform word segmentation processing on data in various modes, such as a jieba word segmentation tool.
(2) Stop words other than negative words
A stop word list of stop words other than the negative words applicable to the item is constructed. Stop words in the data are removed.
Step 2: named entity recognition introducing word clustering.
And carrying out named entity recognition on the electronic health case data, and introducing a word clustering method to update the named entity recognition result. Similar information items are organized together to form classes according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class.
The similarity between an entity and each entity class is defined as the distance between the entity and the closest entity in the class, and the distance between two entities is defined as: assume two words W1, W2, if W1 has n concepts: s11,S12,…S1n(ii) a W2 has m concepts: s21,S22,…S2mThen, the similarity of W1 and W2 is the maximum value of the similarity of each concept:
Sim(W1,W2)=max sim(S1i,S2j)
and finally, selecting the entity class with higher similarity to the entity as the class of the entity. All ICD codes were then standardized to 4 digits for standardized processing, drug data, vital signs, treatment procedures, laboratory test results, and the like.
And step 3: processing negative word part in electronic health case
The invention specially processes the negative word part in the electronic health case. For electronic health cases, the appearance of the negative word is extremely frequent. For example, negation of a symptom is often expressed as: no/no smell/repudiation/no description/no accompanying + symptoms. The general approach to negation words is to delete all negative parts from the medical record before indexing using ConText6 directly. The invention adopts a 'reverse selection' method to improve the negative part appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And then defining the medical concept set of the named entity recognition result as B, and finally deleting the intersection with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.
And 4, step 4: medical concept embedding is learned using the skip-gram algorithm.
In the invention, a skip-gram algorithm is used for learning the embedding of medical concepts: for each patient, the patient data was partitioned at 10 day intervals. And performing duplicate item deletion processing on the data in each time interval area. And randomly disordering the data in each region, representing each time interval region as a unique medical concept sequence, and sending the unique medical concept sequence into a word2vec algorithm as a sentence for training. Finally, each medical concept is represented as a 200-dimensional embedded vector, and all medical concepts are mapped in the same metric space.
And 5: obtaining patient characterization from medical concept embedding
For each region divided by time intervals in a clinical case of a patient, a simple sentence aggregation method is used. First, the vector of the patient is modeled from the vector representation learned for the medical concept in step 4. Medical case data of a patient in a time interval area is selected, vector representations in step 4 corresponding to medical concepts in the medical case data are added, and then averaged and a projection of the average vector on a first principal component of the average vector is subtracted to serve as a representation of the patient. This allows the primary shared components to be removed from the vector, thereby enabling more discriminative aggregate embedding. The weight of phenotype W was calculated as:
W=a/(a+P(W))
where a is the parameter and p (w) is the (estimated) phenotypic frequency of the entire data set.
Step 6: and obtaining the concepts relevant to the query through query expansion.
And respectively measuring the relation between each patient and the query vector by using cosine distance for each vector after the expanded query. For each queue to be queried, definitions such as: disease, we obtain its related concepts through query expansion. The N "disease", "drug", "symptom" expansions that are closest to the query ICD-9 are added to the original query, respectively, and the cosine distance is used to measure the relationship of each patient to the query vector, and the vector of the patient closest to it is used as the final score. Finally, this operation is repeated for all vectors of the expanded words.
And 7: and sorting the patient query documents according to the distance between the patient and the query, and outputting the candidate documents according to the sorting.
And obtaining the distance between the final query and the candidate patient case document, and sequencing and outputting the candidate document according to the distance.
F(q,d|G)=w0 Dws(q,d)
Wherein D isws(q, d) is the distance between the query and the patient document in step 6, w0Is the learned parameter, and f (q, d | G) is the final sorted result.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims (8)

1. A retrospective queue selection method and device based on word vector modeling and information retrieval comprises the following steps:
step 1: electronic health case data preprocessing.
Step 2: named entity recognition introducing word clustering.
And step 3: and processing the negative word part in the electronic health case.
And 4, step 4: medical concept embedding is learned using the skip-gram algorithm.
And 5: patient characterization is obtained from medical concept embedding.
Step 6: and obtaining the concepts relevant to the query through query expansion.
And 7: and measuring the distance between each vector after query expansion and the patient candidate document and outputting the vectors in sequence.
2. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 1 comprises the following steps: the word segmentation processing of the data is performed in various ways, such as a "jieba word segmentation tool". Stop words other than the negative word are then removed.
3. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 2 comprises the following steps: and carrying out named entity recognition on the electronic health case data, and introducing a word clustering method to update the named entity recognition result. Similar information is organized together to form a class according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class. And finally, improving the result of the initial named body recognition by considering the division of the entity class.
4. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 3 is as follows: the negative word part in the electronic health case is specially processed. Adopting a 'reverse selection' method to carry out improved processing on negative parts appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And finally, deleting the intersection set with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.
5. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: medical concept embedding is learned using the skip-gram algorithm. The realization method comprises the following steps: for each patient, the patient data was partitioned at 10 day intervals. And performing duplicate item deletion processing on the data in each time interval area. And randomly disordering the data in each region, representing each time interval region as a unique medical concept sequence, and sending the unique medical concept sequence into a word2vec algorithm as a sentence for training. Finally, each medical concept is represented as a 200-dimensional embedded vector, and all medical concepts are mapped in the same metric space.
6. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: for each region in a patient clinical case divided by time intervals, a simple sentence aggregation method is used to obtain patient characterization from medical concepts.
7. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 6 comprises the following steps: for each queue to be queried, we obtain its associated concepts through query expansion. Adding the query expansion words into the original query, respectively measuring the relation between each patient and the query vector by using cosine distance, and taking the vector of the patient closest to the cosine distance as a final score. Finally, this operation is repeated for all vectors of the expanded words.
8. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 7 is as follows: and sorting the patients according to the distance between the patients and the query, and outputting the candidate documents according to the sorting.
CN201910438020.XA 2019-05-20 2019-05-20 Retrospective queue selection method and device based on word vector modeling and information retrieval Pending CN111966780A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910438020.XA CN111966780A (en) 2019-05-20 2019-05-20 Retrospective queue selection method and device based on word vector modeling and information retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910438020.XA CN111966780A (en) 2019-05-20 2019-05-20 Retrospective queue selection method and device based on word vector modeling and information retrieval

Publications (1)

Publication Number Publication Date
CN111966780A true CN111966780A (en) 2020-11-20

Family

ID=73357779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910438020.XA Pending CN111966780A (en) 2019-05-20 2019-05-20 Retrospective queue selection method and device based on word vector modeling and information retrieval

Country Status (1)

Country Link
CN (1) CN111966780A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076411A (en) * 2021-04-26 2021-07-06 同济大学 Medical query expansion method based on knowledge graph

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN105677873A (en) * 2016-01-11 2016-06-15 中国电子科技集团公司第十研究所 Text information associating and clustering collecting processing method based on domain knowledge model
US20160306877A1 (en) * 2013-12-13 2016-10-20 Danmarks Tekniske Universitet Method of and system for information retrieval
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN107945871A (en) * 2017-12-19 2018-04-20 贵州医科大学附属医院 A kind of blood disease intelligent classification system based on big data
CN108399238A (en) * 2018-03-01 2018-08-14 福州大学 A kind of viewpoint searching system and method for fusing text generalities and network representation
CN109003682A (en) * 2018-06-25 2018-12-14 广州市品毅信息科技有限公司 Adverse drug reaction intelligent monitoring method based on domain ontology repository
CN109659033A (en) * 2018-12-18 2019-04-19 浙江大学 A kind of chronic disease change of illness state event prediction device based on Recognition with Recurrent Neural Network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
US20160306877A1 (en) * 2013-12-13 2016-10-20 Danmarks Tekniske Universitet Method of and system for information retrieval
CN105677873A (en) * 2016-01-11 2016-06-15 中国电子科技集团公司第十研究所 Text information associating and clustering collecting processing method based on domain knowledge model
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN107945871A (en) * 2017-12-19 2018-04-20 贵州医科大学附属医院 A kind of blood disease intelligent classification system based on big data
CN108399238A (en) * 2018-03-01 2018-08-14 福州大学 A kind of viewpoint searching system and method for fusing text generalities and network representation
CN109003682A (en) * 2018-06-25 2018-12-14 广州市品毅信息科技有限公司 Adverse drug reaction intelligent monitoring method based on domain ontology repository
CN109659033A (en) * 2018-12-18 2019-04-19 浙江大学 A kind of chronic disease change of illness state event prediction device based on Recognition with Recurrent Neural Network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076411A (en) * 2021-04-26 2021-07-06 同济大学 Medical query expansion method based on knowledge graph

Similar Documents

Publication Publication Date Title
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN110109835B (en) Software defect positioning method based on deep neural network
US8341159B2 (en) Creating taxonomies and training data for document categorization
US9230009B2 (en) Routing of questions to appropriately trained question and answer system pipelines using clustering
Wu et al. Webiq: Learning from the web to match deep-web query interfaces
US20140358928A1 (en) Clustering Based Question Set Generation for Training and Testing of a Question and Answer System
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN108062978B (en) Method for predicting main adverse cardiovascular events of patients with acute coronary syndrome
CN113076411B (en) Medical query expansion method based on knowledge graph
CN107291895B (en) Quick hierarchical document query method
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN111832306A (en) Image diagnosis report named entity identification method based on multi-feature fusion
CN115983233A (en) Electronic medical record duplication rate estimation method based on data stream matching
CN110399493B (en) Author disambiguation method based on incremental learning
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
Ektefa et al. A comparative study in classification techniques for unsupervised record linkage model
Li et al. Improved deep belief network model and its application in named entity recognition of Chinese electronic medical records
CN111966780A (en) Retrospective queue selection method and device based on word vector modeling and information retrieval
Fu et al. A supervised learning and group linking method for historical census household linkage
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
Daradkeh et al. Lifelong machine learning for topic modeling based on hellinger distance
Banerjee et al. A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports
Kaur A comparison of machine learning classifiers for use on historical record linkage
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
Lu et al. Improving web search relevance with semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination