CN111966780A

CN111966780A - Retrospective queue selection method and device based on word vector modeling and information retrieval

Info

Publication number: CN111966780A
Application number: CN201910438020.XA
Authority: CN
Inventors: 王嫄; 孔娜; 张雪; 王栋; 赵婷婷; 王洁; 史艳翠
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2020-11-20

Abstract

A retrospective queue selection method and device based on word vector modeling and information retrieval. Aiming at the problems of low recall rate, poor accuracy and incomplete retrieval information of a retrospective queue selection method, named entity identification is carried out by introducing a word clustering method during electronic health case information preprocessing, a processing method which is adopted conventionally and aims at negative words is improved, vector representations of different entities are obtained by utilizing skip-gram algorithm to learn embedding of medical concepts in electronic health cases, vector representations of patients are modeled, patient representations are learned from the medical concepts, concepts related to query are obtained by query expansion, and then the relation between each patient and a query vector is measured by utilizing cosine distances respectively, and the patients are sequenced and output according to the distances. The invention has reasonable design, and can effectively improve the accuracy of semantic matching and the recall rate of queue selection after improving the preprocessing method of the negative words.

Description

Retrospective queue selection method and device based on word vector modeling and information retrieval

Technical Field

The invention belongs to the field of intelligent search, relates to named entity identification, negative part processing, word embedding, query expansion and queue selection, and particularly relates to a retrospective queue selection method based on word vector modeling and information retrieval in the medical field environment.

Background

Electronic health case cohort selection is one type of information search, and electronic health case-based research provides countless opportunities for biomedical research and precision medicine. For this aspect of research, there is a need to enable reliable detection of patients with a particular disease or condition for cohort studies. However, it is not a trivial matter in an electronic case system to accurately identify patients with a particular disease due to limitations in input errors, coding bias, medical reporting bias, data availability, data structure, and accuracy of identification. Defining cases, disease groups by a single clinical concept (e.g., international statistical classification coding of diseases and related health problems) is often not sufficient to produce reliable results. In addition, the performance of these concepts in identifying different diseases varies greatly, and problems such as low recall rate, easy failure of semantic matching, incomplete search information, etc. are encountered when selecting a queue. The following efforts have been made to address these problems.

Firstly, a clustering technology is simply used for searching, and similar information is organized together to form a class according to a certain similarity measurement method. During retrieval, the similarity between the query vector and each class is calculated, those classes with similarity to the query vector above a certain threshold are selected, and then the similarity of the query vector to each vector in the classes is calculated, wherein the top R closest items are sorted back. Secondly, text semantic modeling is carried out, and LDA is the most classical text potential theme modeling method with the strongest interpretability. The method is a three-layer Bayes probability model, comprises three layers of structures of words, subjects and documents, is an unsupervised machine learning technology, can be used for identifying potential subject information in a large-scale document set or a corpus, and adopts a bag-of-words method, wherein each method is regarded as a word frequency vector, so that text information is converted into digital information easy to model. Topic modeling techniques have been used to map queries and documents into a potential space for the first time and then match them against it. In the data preprocessing of electronic health cases, the following steps are generally taken: firstly, the text content is subjected to word segmentation, and then stop words such as 'thus', 'normal', negative words and the like which have no use value in the text content are removed from the segmented result. Since the electronic health case is professional description of the health condition of the patient by the doctor, the extremely strong professional property and writing habits of the doctor need to be considered. One language feature that is evident in electronic health cases is the frequent application of negative words, for example, denial of a symptom can be expressed as: form of "none/not smelling and/repudiation/not describing/not accompanying + symptoms". If these frequently occurring negative words are not handled well, we will have fatal misleading to our search work. A common way to handle negative words is to use ConText6 to delete all negative parts from the medical record before indexing. However, directly deleting all negative parts greatly affects the precision and effectiveness of the retrieval.

The method is used for queue selection, and problems of low recall rate, incomplete retrieval information and the like can be caused. The main reason is that the method is based on the precise matching of words or entities, the invention improves the preprocessing method aiming at negative words, learns the embedding of medical concepts by using a deep learning algorithm, and obtains the concepts related to diseases by query expansion, and comprises the following steps: the system has the advantages that the patients are sorted and output according to the distance between the patients and the query, so that the semantic matching accuracy can be effectively improved, more accurate information matching can be quickly realized, and the recall rate of queue selection is improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, solve the problems of low recall rate and easy failure of semantic matching and incomplete retrieval information of the traditional queue selection method, and effectively improve the accuracy rate of the queue identification field of electronic health cases by retrospective queue selection based on word vector modeling and information retrieval.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

step 1: electronic health case data preprocessing.

Step 2: named entity recognition introducing word clustering.

And step 3: and processing the negative word part in the electronic health case.

And 4, step 4: medical concept embedding is learned using the skip-gram algorithm.

And 5: patient characterization is obtained from medical concept embedding.

Step 6: and obtaining the concepts relevant to the query through query expansion.

And 7: and measuring the distance between each vector after query expansion and each patient candidate document and outputting the vectors in sequence.

In step 2, named entity recognition is carried out on the electronic health case data, and a word clustering method is introduced to update the named entity recognition result. Similar information is organized together to form a class according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class. And finally, improving the result of the initial named body recognition by considering the division of the entity class.

In step 3, after the electronic health case data is preprocessed in step 1, the negative word part in the electronic health case is specially processed. For example, news-like text presents an obvious domain characteristic that full-text content can be spread out closely around the center of a meeting point of a news headline, while medical-like text more frequently presents the semantic "elimination method" to approach the center of the meeting point in terms of expression. For electronic health cases, the appearance of negation words is extremely frequent, e.g., negation of symptoms is often expressed as: no/no smell/repudiation/no description/no accompanying + symptoms. The general approach to negation words is to delete all negative parts from the medical record before indexing using ConText6 directly. The invention adopts a 'reverse selection' method to carry out improved processing on the negative part appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And finally, deleting the intersection set with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.

In step 4, for each patient, a word vector modeling method is used to model the patient.

In step 5, for each region of the patient clinical case divided by time intervals, a simple sentence aggregation method is used to obtain patient characterization from the medical concept.

In step 6, for each queue to be queried, we obtain its related concepts through query expansion. Adding the query expansion words into the original query, respectively measuring the relation between each patient and the query vector by using cosine distance, and taking the vector of the patient closest to the cosine distance as a final score. Finally, this operation is repeated for all vectors of the expanded words.

In step 7, the patients are ranked according to their distance from the query, and the candidate documents are output according to the ranking.

Drawings

FIG. 1 is a schematic flow diagram of the system of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Step 1: pre-processing of electronic health case data.

The method mainly comprises the following steps:

(1) data word segmentation

The invention can perform word segmentation processing on data in various modes, such as a jieba word segmentation tool.

(2) Stop words other than negative words

A stop word list of stop words other than the negative words applicable to the item is constructed. Stop words in the data are removed.

Step 2: named entity recognition introducing word clustering.

And carrying out named entity recognition on the electronic health case data, and introducing a word clustering method to update the named entity recognition result. Similar information items are organized together to form classes according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class.

The similarity between an entity and each entity class is defined as the distance between the entity and the closest entity in the class, and the distance between two entities is defined as: assume two words W1, W2, if W1 has n concepts: s₁₁,S₁₂，…S_1n(ii) a W2 has m concepts: s₂₁，S₂₂，…S_2mThen, the similarity of W1 and W2 is the maximum value of the similarity of each concept:

Sim(W₁，W₂)＝max sim(S_1i，S_2j)

and finally, selecting the entity class with higher similarity to the entity as the class of the entity. All ICD codes were then standardized to 4 digits for standardized processing, drug data, vital signs, treatment procedures, laboratory test results, and the like.

And step 3: processing negative word part in electronic health case

The invention specially processes the negative word part in the electronic health case. For electronic health cases, the appearance of the negative word is extremely frequent. For example, negation of a symptom is often expressed as: no/no smell/repudiation/no description/no accompanying + symptoms. The general approach to negation words is to delete all negative parts from the medical record before indexing using ConText6 directly. The invention adopts a 'reverse selection' method to improve the negative part appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And then defining the medical concept set of the named entity recognition result as B, and finally deleting the intersection with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.

In the invention, a skip-gram algorithm is used for learning the embedding of medical concepts: for each patient, the patient data was partitioned at 10 day intervals. And performing duplicate item deletion processing on the data in each time interval area. And randomly disordering the data in each region, representing each time interval region as a unique medical concept sequence, and sending the unique medical concept sequence into a word2vec algorithm as a sentence for training. Finally, each medical concept is represented as a 200-dimensional embedded vector, and all medical concepts are mapped in the same metric space.

And 5: obtaining patient characterization from medical concept embedding

For each region divided by time intervals in a clinical case of a patient, a simple sentence aggregation method is used. First, the vector of the patient is modeled from the vector representation learned for the medical concept in step 4. Medical case data of a patient in a time interval area is selected, vector representations in step 4 corresponding to medical concepts in the medical case data are added, and then averaged and a projection of the average vector on a first principal component of the average vector is subtracted to serve as a representation of the patient. This allows the primary shared components to be removed from the vector, thereby enabling more discriminative aggregate embedding. The weight of phenotype W was calculated as:

W＝a/(a+P(W))

where a is the parameter and p (w) is the (estimated) phenotypic frequency of the entire data set.

And respectively measuring the relation between each patient and the query vector by using cosine distance for each vector after the expanded query. For each queue to be queried, definitions such as: disease, we obtain its related concepts through query expansion. The N "disease", "drug", "symptom" expansions that are closest to the query ICD-9 are added to the original query, respectively, and the cosine distance is used to measure the relationship of each patient to the query vector, and the vector of the patient closest to it is used as the final score. Finally, this operation is repeated for all vectors of the expanded words.

And 7: and sorting the patient query documents according to the distance between the patient and the query, and outputting the candidate documents according to the sorting.

And obtaining the distance between the final query and the candidate patient case document, and sequencing and outputting the candidate document according to the distance.

F(q，d|G)＝w₀ D_ws(q，d)

Wherein D is_ws(q, d) is the distance between the query and the patient document in step 6, w₀Is the learned parameter, and f (q, d | G) is the final sorted result.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A retrospective queue selection method and device based on word vector modeling and information retrieval comprises the following steps:

step 1: electronic health case data preprocessing.

Step 2: named entity recognition introducing word clustering.

And 5: patient characterization is obtained from medical concept embedding.

And 7: and measuring the distance between each vector after query expansion and the patient candidate document and outputting the vectors in sequence.

2. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 1 comprises the following steps: the word segmentation processing of the data is performed in various ways, such as a "jieba word segmentation tool". Stop words other than the negative word are then removed.

3. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 2 comprises the following steps: and carrying out named entity recognition on the electronic health case data, and introducing a word clustering method to update the named entity recognition result. Similar information is organized together to form a class according to a certain similarity measurement method, in the named entity identification process, the similarity between an entity and each entity class is calculated, and the entity class with higher similarity to the entity is selected as the entity class. And finally, improving the result of the initial named body recognition by considering the division of the entity class.

4. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 3 is as follows: the negative word part in the electronic health case is specially processed. Adopting a 'reverse selection' method to carry out improved processing on negative parts appearing in the text, namely: firstly, finding out the entity part followed by the negative word, and defining the search result set as A. And finally, deleting the intersection set with the search set A in the result B to obtain a medical concept result set C which has positive influence on the queue selection.

5. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: medical concept embedding is learned using the skip-gram algorithm. The realization method comprises the following steps: for each patient, the patient data was partitioned at 10 day intervals. And performing duplicate item deletion processing on the data in each time interval area. And randomly disordering the data in each region, representing each time interval region as a unique medical concept sequence, and sending the unique medical concept sequence into a word2vec algorithm as a sentence for training. Finally, each medical concept is represented as a 200-dimensional embedded vector, and all medical concepts are mapped in the same metric space.

6. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: for each region in a patient clinical case divided by time intervals, a simple sentence aggregation method is used to obtain patient characterization from medical concepts.

7. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 6 comprises the following steps: for each queue to be queried, we obtain its associated concepts through query expansion. Adding the query expansion words into the original query, respectively measuring the relation between each patient and the query vector by using cosine distance, and taking the vector of the patient closest to the cosine distance as a final score. Finally, this operation is repeated for all vectors of the expanded words.

8. The retrospective queue selection method and device based on word vector modeling and information retrieval as claimed in claim 1, wherein: the specific implementation method of the step 7 is as follows: and sorting the patients according to the distance between the patients and the query, and outputting the candidate documents according to the sorting.