CN111292818B

CN111292818B - Query reconstruction method for electronic medical record description

Info

Publication number: CN111292818B
Application number: CN202010051309.9A
Authority: CN
Inventors: 方钰; 姚窅; 陆明名; 黄欣; 翟鹏珺
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2022-04-19
Anticipated expiration: 2040-01-17
Also published as: CN111292818A

Abstract

A query reconstruction method aiming at the description of an electronic medical record is a query reconstruction method aiming at a long text of the electronic medical record in clinical decision support. Query reconstruction in information retrieval refers to a process of automatically processing information input by a user to form a new query expression, and aims to further mine the real intention of user query in a large amount of complex information data and improve the retrieval effect. In the medical field, medical literature retrieval is an important application of clinical decision support, and electronic medical record texts are used as query input to acquire required information from massive medical texts. The description of the electronic medical record is very complex, and query reconstruction is needed to obtain an effective retrieval effect. In contrast, the method aims at the electronic medical record description long text with redundant information, utilizes sub-query segmentation and screening and query quality prediction technologies, selects the sub-query with the highest query quality to replace the original query, predicts the query intention and reconstructs the query, and improves the retrieval efficiency.

Description

Query reconstruction method for electronic medical record description

Technical Field

The invention relates to the field of text retrieval, in particular to processing of queries in text retrieval.

Background

Information retrieval is a process of finding information needed by a user from unstructured large-scale data, and is an effective method for acquiring key information from massive data. As early as the last century, medical experts considered to use data and models to assist clinical decision, and therefore proposed a clinical decision support system. The clinical decision support system is a medical information technology application system, and comprises the application of information retrieval technology to predict diagnosis, namely, the auxiliary decision of using the description of patients in medical records as query to search relevant medical documents. By the method, the clinical decision support system can effectively mine deep data in medical treatment, improve the medical service efficiency and accelerate the medical informatization process.

With the development of medical health industry and the progress of scientific technology, the level of technological and informatization of the medical industry is continuously improved, and a diagnosis decision support system is taken as a very active application branch in a clinical decision support system and is always a hot spot of research and application at home and abroad. The diagnosis decision support system is a computer application system which can provide auxiliary support for doctors in the diagnosis decision process and can provide a great deal of medical support for clinicians, so that the clinicians are helped to make the most reasonable diagnosis and select the best treatment measures. A great deal of research shows that the diagnosis decision support system can effectively solve the problem of limitation of knowledge in the process of disease diagnosis of a clinician, reduce human negligence, relatively reduce medical expenses and provide guarantee for medical quality.

To better study medical text retrieval technology, the text retrieval conference TREC proposed a Clinical Decision Support (CDS) task in 2014. The task gives the description of the electronic medical record as an input query, and the competitor searches the existing medical literature set to return the document most relevant to the query, so that the doctor is provided with the real requirement for assisting in judging the patient. The electronic medical records as queries are stored in free text form, the medical record data is derived from the ICU clinical database MIMIC-III, and the target document set as a search library is from the American biomedical and Life sciences full text database PubMed Central.

From the existing research results of the TREC CDS task, the main work of document retrieval focuses on processing the original query statement, the main method is query expansion based on keywords, and the query processing work for the long text of the electronic medical record is very rare. The electronic medical record text has high ambiguity, the text has a large amount of redundancy and unclear semantics, and the processing method of the text is undoubtedly difficult in the task of searching medical literature.

Disclosure of Invention

From the existing research results of the TREC CDS task, the main work of document retrieval focuses on processing the original query statement, the main method is query expansion based on keywords, and the query processing work for the long text of the electronic medical record is very rare. Common queries are several to a dozen query terms in length, while any electronic medical record in clinical decision describes an average length of 50-200 query terms, and most commercial and academic search engines do not work well in processing such long queries, meaning that it leaves the user with the task of reducing the original query to a shorter one. On the other hand, the definite query intention can be targeted for retrieval to improve the retrieval efficiency.

Aiming at the problems, the invention aims at reconstructing the query statement, adopts an SVM classifier to obtain the query intent of the query statement, generates and screens the sub-queries of the electronic medical record, then obtains the optimal sub-query in the sub-queries by training a query quality prediction model, and combines the optimal sub-query with the query intent to generate the reconstructed query statement.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention provides a query reconstruction method aiming at electronic medical record description, which comprises the following steps:

step 1, preprocessing an electronic medical record text and a medical literature text in a data set;

step 2, training an SVM classifier to predict the query intention of the electronic medical record text;

step 3, acquiring all sub-queries of the electronic medical record text and performing preliminary pre-screening on the sub-queries;

step 4, training a query quality prediction model, and selecting the optimal sub-query from the sub-queries output in the pre-screening in the step 3;

and 5, combining the query intention obtained in the step 2 with the optimal sub-query output in the step 4 to obtain a final reconstructed query.

Advantageous effects

The invention aims at the electronic medical record description long text with redundant information to carry out query reconstruction processing, including the prediction of query intention and the reduction of original query based on a query quality prediction technology. The invention analyzes the original query semantics and trains a classifier to realize the judgment of the query intention. On the other hand, the invention analyzes all the sub-query sets of the original query, represents each sub-query through a group of query quality indexes, and provides an index reflecting the query expansion performance for the first time, and on the basis, trains a query quality prediction model to obtain the sub-query with the highest query quality to replace the original query.

According to the invention, a query reconstruction experiment is carried out on a TREC CDS data set, and the remarkable improvement of performance is observed, so that the improvement of a retrieval result by query reconstruction is also proved. The query reconstruction method aiming at the long text of the electronic medical record has great significance for solving the problem of limitation of knowledge in the disease diagnosis process of a clinician, reducing human negligence, relatively reducing medical expenses and providing guarantee for medical quality.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart of a query reformulation method;

FIG. 2 is an example of a TREC query topic;

FIG. 3 is a result of processing electronic medical record text using the query reformulation.

Detailed Description

The specific implementation process of the invention is shown in fig. 1, and comprises the following 5 steps:

Each step is described in detail below.

Step 1: preprocessing electronic medical record text and medical literature text in a data set

As an example, the dataset used is from the TREC CDS Track dataset, including electronic medical record text and medical literature sets. Where the electronic medical record text is from 90 query topics (Topic) defined by TREC conference 2014-2016, the medical record data is from ICU clinical database MIMIC-III, and we select the description field in each Topic as an original query. FIG. 2 illustrates an example of a query Topic (Topic). On the other hand, the medical literature is a collection of over 73 ten thousand medically relevant documents from the U.S. biomedical and life sciences text database PubMed Central.

In the step 1, the electronic medical record text and the medical literature text need to be preprocessed, and the method specifically comprises the following steps:

1.4, extracting plain text

This step is not required because the electronic medical record text itself is plain text. The medical literature text is a webpage file stored in an XML format, useless CSS and JS codes are required to be removed, and required plain text data including titles, abstracts, keywords and body parts of the literature are taken out according to an XML tag, so that the preprocessed literature text has a uniform format.

Providing the plain text of the electronic medical record and the extracted plain text of the medical literature to the step 1.2;

1.5, removing stop words

And removing stop words in the plain text by utilizing the preprocessed vocabulary, wherein the stop words comprise words without semantic information and words with high use frequency.

The result after the stop word is removed is provided to step 1.3;

1.6 restoring part of speech

Different parts of speech are integrated and restored to root words, words with the same meaning in English have different tense changes, and the words are restored in terms of speech.

After the part of speech is restored, the text preprocessing work of the step 1 is completed, the preprocessed electronic medical record text is provided for the step 2 and the step 3, and the preprocessed medical literature text is provided for the step 4.

Step 2: method for predicting query intention of electronic medical record text by training SVM (support vector machine) three classifiers

And 2, training an SVM classifier to judge the query intention by using the preprocessed electronic medical record text obtained in the step 1 as a training set, and specifically comprising the following steps.

2.1, labeling three classification labels for each electronic medical record text in the training set: if the text content of the electronic medical record belongs to Diagnosis (Diagnosis), marking as 1; if the text content of the electronic medical record belongs to a Treatment scheme (Treatment), the text content is marked as 2; if the text content of the electronic medical record belongs to the diagnosis and detection means (Test), the label is 3. The annotated results are provided to step 2.2.

2.2, training a three-classifier.

The training of the three classifiers uses the existing SVM algorithm, and the features of the electronic medical record text and the three classification labels labeled in the step 2.1 are required to be input during the training. The training of the classifier requires the use of two features of the electronic medical record text: (1) a TF-IDF value; (2) and (4) semantic information.

(1) TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. Where Term Frequency (TF) refers to the frequency with which a given term appears in the document. The Inverse Document Frequency (IDF) is obtained by dividing the total document number by the document number containing the word, and taking the obtained quotient to be a logarithm with the base 10. The TF-IDF value is the product of these two values and has the formula

Wherein n is_ωRepresenting the number of occurrences of the word omega in the document, N representing the total number of words in the document, N_dRepresenting the total number of documents in the corpus, N_ωRepresenting packets in a corpusThe number of files containing the word ω.

(2) The semantic information refers to three parts of information: whether diagnostic results are included (value 0/1), whether the check is complete (value 0/1), query text length (value 0-200).

The trained three classifiers are provided to step 2.3.

And 2.3, inputting the electronic medical record text into the trained three classifiers, and providing classification results (namely query intentions) to the step 5.

And step 3: all sub-queries of the electronic medical record text are obtained and subjected to preliminary pre-screening

Theoretically, a query containing n query terms may obtain sub-queries of a number level, for example, the sub-query into which the query statement "fe ver core header" containing 3 query terms may be split includes "fe ver," core, "" header, "" fe ver core, "" fe ver header, "" cog header, "" fe ver core. It is impractical to exhaust all possible sub-queries for ranking, so the sub-queries need to be pre-screened first. Sub-query pre-screening comprises the following steps.

And 3.1, selecting the subqueries with the length of 3-10 from all the subqueries of the electronic medical record text. The sub-query length refers to the number of words in the query. Research shows that the query length for optimizing the retrieval effect is between 3 and 6, and the maximum length threshold is defined to be 10 in consideration of the electronic medical record long text targeted by the invention. The result is provided to step 3.2.

And 3.2, calculating the average mutual information quantity of each sub-query obtained in the step 3.1, and selecting the 30 sub-queries with the highest mutual information quantity. The average mutual information quantity calculation formula of the sub-queries is as follows:

where n (x, y) represents the frequency with which the word x and the word y appear simultaneously in a document with a window size of 25 throughout the corpus, and n (x), n (y) represent the frequency with which the word x and the word y appear in the corpus, respectively. N is a radical of_cRepresenting the wholeNumber of words in the corpus. And calculating mutual information quantity of any two words in one sub-query, and taking the weighted average of the mutual information quantity as the average mutual information quantity of the sub-queries.

And step 3, finally obtaining 30 sub-queries after pre-screening, and providing the result to step 4.

And 4, step 4: training a query quality prediction model, and selecting the optimal sub-query from the pre-screened sub-queries

4.1 labeling sub-queries with query quality scores

And (3) performing one round of retrieval on each pre-screened sub-query obtained in the step (3), wherein the retrieved target document set is from the medical document text set preprocessed in the step (1). The search engine used Indri5.11 in the Lemur open source project. And comparing the retrieval result with the evaluation standard provided by the TREC conference, calculating to obtain the average accuracy score of the retrieval, and marking the average accuracy score as the query quality score of the sub-query. The sub-queries with labeled query quality scores are provided to step 4.2 as the result of this step.

4.2 training query quality prediction model

The existing SVMRank algorithm is used for training the query quality prediction model, and indexes capable of representing the sub-query quality and the query quality scores marked in the step 4.1 need to be input during training.

Model training requires the use of the following indices, calculated for each sub-query in the training set: (1) an inverse document frequency correlation index; (2) simplifying the query definition index; (3) corpus/query similarity feature index; (4) and querying the expandability index.

The symbolic meaning used in this step is defined before introducing these indices separately. For a query Q, assume that it contains the query term ω₁,…ω_nN (ω) in corpus C_i) Representing a query term omega_iFrequency of occurrence in corpus, n (ω)_i,ω_j) Representing query words omega in a corpus_i,ω_j(i ≠ j) frequency of simultaneous occurrence in a window of 25 words in length, N_cRepresenting the total number of words contained in the corpus, N_ωNumber of documents in which the query word ω appears, N_dTo representThe number of all documents in the corpus. P_c(ω) represents the probability of the occurrence of the query word ω in the corpus, P (ω | Q) represents the probability of the occurrence of ω in the query sentence Q, S_ωA set of synonyms representing the word ω.

(1) The inverse document frequency correlation index calculation formula is as follows:

wherein N is_ωNumber of documents containing word ω, N_dThe total number of documents in the corpus. For each sub-query, the sum, maximum, standard deviation, arithmetic mean, geometric mean, and harmonic mean of each query term IDF value are calculated together as query quality indicators.

(2) The simplified query definition index calculation formula is as follows:

wherein P is_ml(ω | Q) is the frequency of occurrence of the word ω in the query Q, P_c(ω) the frequency with which the word ω appears in the corpus.

(3) The corpus/query similarity characteristic index calculation formula is as follows:

like the inverse document frequency correlation index, the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of the SCQ value of each query term are calculated together as the query quality index.

(4) Query extensibility index

The invention firstly provides an index reflecting the query expansion performance, namely a query expansion index. The calculation formula is as follows:

wherein S ω is a synonym set of the query term ω, and P (α | Q) refers to the probability of occurrence of the query term α in the query model. Queries that are more scalable can be considered to be of higher query quality because more relevant documents can be retrieved after query expansion of them.

And (4) providing the trained query quality prediction model to a step 4.3.

4.3, calculating 4 indexes representing the sub-query quality in the step 4.2 for each pre-screened sub-query obtained in the step 3, and inputting the indexes into the query quality prediction model obtained by training in the step 4.2 to obtain the query quality score of the sub-query. And selecting the sub-query with the highest query quality score in the 30 sub-queries as the optimal sub-query, and providing the result to the step 5.

And 5: obtaining final reconstruction query by combining query intention and optimal sub-query

And combining the query intention obtained in the step 2 and the optimal sub-query obtained in the step 4 to obtain a reconstructed query serving as a final result.

Claims

1. A query reconstruction method for electronic medical record description is characterized by comprising

step 5, combining the query intention obtained in the step 2 and the optimal sub-query output in the step 4 to obtain a final reconstruction query;

wherein

1.1, extracting plain text

Because the text of the electronic medical record is a plain text, the step is not needed; the medical literature text is a webpage file stored in an XML format, useless CSS and JS codes are required to be removed, and required plain text data including titles, abstracts, keywords and body parts of the literature are taken out according to an XML tag, so that the preprocessed literature text has a uniform format;

1.2, removing stop words

Removing stop words in the plain text by utilizing the preprocessed vocabulary, wherein the stop words comprise vocabularies without semantic information and vocabularies with high use frequency;

the result after the stop word is removed is provided to step 1.3;

1.3 restoring part of speech

Different parts of speech are integrated and restored to root words, words with the same meaning in English have different tense changes, and the words are restored in part of speech;

after the part of speech is restored, the text preprocessing work of the step 1 is completed, the preprocessed electronic medical record text is provided for the step 2 and the step 3, and the preprocessed medical literature text is provided for the step 4;

Step 2, training an SVM classifier to judge the query intention by using the preprocessed electronic medical record text obtained in the step 1 as a training set, and specifically comprising the following steps;

2.1, labeling three classification labels for each electronic medical record text in the training set: if the text content of the electronic medical record belongs to diagnosis, marking as 1; if the text content of the electronic medical record belongs to the treatment scheme, marking as 2; if the text content of the electronic medical record belongs to a diagnosis and detection means, marking as 3; the annotated result is provided to step 2.2;

2.2 training three classifiers

The training of the three classifiers uses a Support Vector Machine (SVM) algorithm, and the characteristics of the electronic medical record text and the three classification labels marked in the step 2.1 are required to be input during the training; the training of the classifier requires the use of two features of the electronic medical record text: (1) a TF-IDF value; (2) semantic information;

(1) TF-IDF is a statistical method to evaluate the importance of a word to one of the documents in a corpus; the word frequency TF refers to the frequency of a given word appearing in the file, and the inverse file frequency IDF is obtained by dividing the total number of files by the number of files containing the word and taking the obtained quotient to be a logarithm with the base of 10; the TF-IDF value is the product of these two values and has the formula

Wherein n is_ωRepresenting the number of occurrences of the word omega in the document, N representing the total number of words in the document, N_dRepresenting the total number of documents in the corpus, N_ωRepresenting the number of files containing words omega in the corpus;

(2) the semantic information refers to three parts of information: whether a diagnostic result is included, a value of 0 or 1; whether the check is complete, a value of 0 or 1; inquiring the text length, wherein the value of the text length is 0-200;

providing the three classifiers obtained by training to the step 2.3;

2.3, inputting the electronic medical record text into the trained three classifiers, and providing classification results, namely query intentions, to the step 5;

The sub-query pre-screening comprises the following steps:

3.1, selecting sub-queries with the length of 3-10 from all the sub-queries of the electronic medical record text; sub-query length refers to the number of words in the query; the result is provided to step 3.2;

3.2, calculating the average mutual information quantity of each sub-query obtained in the step 3.1, and selecting 30 sub-queries with the highest mutual information quantity; the average mutual information quantity calculation formula of the sub-queries is as follows:

wherein n (x, y) represents the frequency of the words x and y appearing in the document with the window size of 25 simultaneously in the whole corpus, and n (x), n (y) represent the frequency of the words x and y appearing in the corpus respectively; n is a radical of_cThe number of words representing the entire corpus; calculating mutual information quantity of any two words in one sub-query, and taking a weighted average value of the mutual information quantity as the average mutual information quantity of the sub-query;

step 3, finally obtaining 30 sub-queries after pre-screening, and providing the result to step 4;

4.1 labeling sub-queries with query quality scores

Performing one round of retrieval on each pre-screened sub-query obtained in the step 3, wherein the retrieved target document set is from the medical document text set preprocessed in the step 1; the search engine used Indri5.11 in the Lemur open source project; comparing the retrieval result with an evaluation standard provided by a TREC conference, calculating to obtain an average accuracy score of retrieval, and marking the average accuracy score as the query quality score of the sub-query; the sub-queries with labeled query quality scores are provided as the result of this step to step 4.2;

4.2 training query quality prediction model

The existing SVMRank algorithm is used for training the query quality prediction model, and indexes capable of representing the sub-query quality and the query quality scores marked in the step 4.1 are required to be input during training;

model training requires the use of the following indices, calculated for each sub-query in the training set: (1) an inverse document frequency correlation index; (2) simplifying the query definition index; (3) corpus/query similarity feature index; (4) querying an extensibility index;

before introducing the indexes respectively, defining the symbolic meanings used in the step; for a query Q, the query Q is,suppose it contains the query term ω₁,…ω_nN (ω) in corpus C_i) Representing a query term omega_iFrequency of occurrence in corpus, n (ω)_i,ω_j) Representing query words omega in a corpus_i,ω_jWhere i ≠ j, the frequency of simultaneous occurrences in a window of 25 words in length, N_cRepresenting the total number of words contained in the corpus, N_ωNumber of documents in which the query word ω appears, N_dRepresenting the number of all documents in the corpus; p_c(ω) represents the probability of the occurrence of the query word ω in the corpus, P (ω | Q) represents the probability of the occurrence of ω in the query sentence Q, S_ωA set of synonyms representing a word ω;

wherein N is_ωNumber of documents containing word ω, N_dThe total document number in the corpus; for each sub-query, calculating the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of each query term IDF value as query quality indexes;

(2) the simplified query definition index calculation formula is as follows:

wherein P is_ml(ω | Q) is the frequency of occurrence of the word ω in the query Q, P_c(ω) frequency of occurrence of words ω in the corpus;

calculating the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of each query term SCQ value as a query quality index as well as the inverse document frequency correlation index;

(4) query extensibility index

An index reflecting the performance of query expansion-the query expansion index; the calculation formula is as follows:

wherein S omega is a synonym set of the query word omega, and P (alpha | Q) refers to the occurrence probability of the query word alpha in the query model; the query quality is higher for queries with higher query extensibility, because more relevant documents can be retrieved after query expansion is performed on the queries;

providing the query quality prediction model obtained by training to the step 4.3;

4.3, calculating 4 indexes representing the sub-query quality in the step 4.2 for each pre-screened sub-query obtained in the step 3, and inputting the indexes into the query quality prediction model obtained by training in the step 4.2 to obtain the query quality score of the sub-query; selecting the sub-query with the highest query quality score from the 30 sub-queries as the optimal sub-query, and providing the result to the step 5;