CN111292818A - Query reconstruction method for electronic medical record description - Google Patents

Query reconstruction method for electronic medical record description Download PDF

Info

Publication number
CN111292818A
CN111292818A CN202010051309.9A CN202010051309A CN111292818A CN 111292818 A CN111292818 A CN 111292818A CN 202010051309 A CN202010051309 A CN 202010051309A CN 111292818 A CN111292818 A CN 111292818A
Authority
CN
China
Prior art keywords
query
sub
text
medical record
electronic medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010051309.9A
Other languages
Chinese (zh)
Other versions
CN111292818B (en
Inventor
方钰
姚窅
陆明名
黄欣
翟鹏珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202010051309.9A priority Critical patent/CN111292818B/en
Publication of CN111292818A publication Critical patent/CN111292818A/en
Application granted granted Critical
Publication of CN111292818B publication Critical patent/CN111292818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A query reconstruction method aiming at the description of an electronic medical record is a query reconstruction method aiming at a long text of the electronic medical record in clinical decision support. Query reconstruction in information retrieval refers to a process of automatically processing information input by a user to form a new query expression, and aims to further mine the real intention of user query in a large amount of complex information data and improve the retrieval effect. In the medical field, medical literature retrieval is an important application of clinical decision support, and electronic medical record texts are used as query input to acquire required information from massive medical texts. The description of the electronic medical record is very complex, and query reconstruction is needed to obtain an effective retrieval effect. In contrast, the method aims at the electronic medical record description long text with redundant information, utilizes sub-query segmentation and screening and query quality prediction technologies, selects the sub-query with the highest query quality to replace the original query, predicts the query intention and reconstructs the query, and improves the retrieval efficiency.

Description

Query reconstruction method for electronic medical record description
Technical Field
The invention relates to the field of text retrieval, in particular to processing of queries in text retrieval.
Background
Information retrieval is a process of finding information needed by a user from unstructured large-scale data, and is an effective method for acquiring key information from massive data. As early as the last century, medical experts considered to use data and models to assist clinical decision, and therefore proposed a clinical decision support system. The clinical decision support system is a medical information technology application system, and comprises the application of information retrieval technology to predict diagnosis, namely, the auxiliary decision of using the description of patients in medical records as query to search relevant medical documents. By the method, the clinical decision support system can effectively mine deep data in medical treatment, improve the medical service efficiency and accelerate the medical informatization process.
With the development of medical health industry and the progress of scientific technology, the level of technological and informatization of the medical industry is continuously improved, and a diagnosis decision support system is taken as a very active application branch in a clinical decision support system and is always a hot spot of research and application at home and abroad. The diagnosis decision support system is a computer application system which can provide auxiliary support for doctors in the diagnosis decision process and can provide a great deal of medical support for clinicians, so that the clinicians are helped to make the most reasonable diagnosis and select the best treatment measures. A great deal of research shows that the diagnosis decision support system can effectively solve the problem of limitation of knowledge in the process of disease diagnosis of a clinician, reduce human negligence, relatively reduce medical expenses and provide guarantee for medical quality.
To better study medical text retrieval technology, the text retrieval conference TREC proposed a Clinical Decision Support (CDS) task in 2014. The task gives the description of the electronic medical record as an input query, and the competitor searches the existing medical literature set to return the document most relevant to the query, so that the doctor is provided with the real requirement for assisting in judging the patient. The electronic medical records as queries are stored in free text form, the medical record data is derived from the ICU clinical database MIMIC-III, and the target document set as a search library is from the American biomedical and Life sciences full text database PubMed Central.
From the existing research results of the TREC CDS task, the main work of document retrieval focuses on processing the original query statement, the main method is query expansion based on keywords, and the query processing work for the long text of the electronic medical record is very rare. The electronic medical record text has high ambiguity, the text has a large amount of redundancy and unclear semantics, and the processing method of the text is undoubtedly difficult in the task of searching medical literature.
Disclosure of Invention
From the existing research results of the TREC CDS task, the main work of document retrieval focuses on processing the original query statement, the main method is query expansion based on keywords, and the query processing work for the long text of the electronic medical record is very rare. Common queries are several to a dozen query terms in length, while any electronic medical record in clinical decision describes an average length of 50-200 query terms, and most commercial and academic search engines do not work well in processing such long queries, meaning that it leaves the user with the task of reducing the original query to a shorter one. On the other hand, the definite query intention can be targeted for retrieval to improve the retrieval efficiency.
Aiming at the problems, the invention aims at reconstructing the query statement, adopts an SVM classifier to obtain the query intent of the query statement, generates and screens the sub-queries of the electronic medical record, then obtains the optimal sub-query in the sub-queries by training a query quality prediction model, and combines the optimal sub-query with the query intent to generate the reconstructed query statement.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention provides a query reconstruction method aiming at electronic medical record description, which comprises the following steps:
step 1, preprocessing an electronic medical record text and a medical literature text in a data set;
step 2, training an SVM classifier to predict the query intention of the electronic medical record text;
step 3, acquiring all sub-queries of the electronic medical record text and performing preliminary pre-screening on the sub-queries;
step 4, training a query quality prediction model, and selecting the optimal sub-query from the sub-queries output in the pre-screening in the step 3;
and 5, combining the query intention obtained in the step 2 with the optimal sub-query output in the step 4 to obtain a final reconstructed query.
Advantageous effects
The invention aims at the electronic medical record description long text with redundant information to carry out query reconstruction processing, including the prediction of query intention and the reduction of original query based on a query quality prediction technology. The invention analyzes the original query semantics and trains a classifier to realize the judgment of the query intention. On the other hand, the invention analyzes all the sub-query sets of the original query, represents each sub-query through a group of query quality indexes, and provides an index reflecting the query expansion performance for the first time, and on the basis, trains a query quality prediction model to obtain the sub-query with the highest query quality to replace the original query.
According to the invention, a query reconstruction experiment is carried out on a TREC CDS data set, and the remarkable improvement of performance is observed, so that the improvement of a retrieval result by query reconstruction is also proved. The query reconstruction method aiming at the long text of the electronic medical record has great significance for solving the problem of limitation of knowledge in the disease diagnosis process of a clinician, reducing human negligence, relatively reducing medical expenses and providing guarantee for medical quality.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic flow chart of a query reformulation method;
FIG. 2 is an example of a TREC query topic;
FIG. 3 is a result of processing electronic medical record text using the query reformulation.
Detailed Description
The specific implementation process of the invention is shown in fig. 1, and comprises the following 5 steps:
step 1, preprocessing an electronic medical record text and a medical literature text in a data set;
step 2, training an SVM classifier to predict the query intention of the electronic medical record text;
step 3, acquiring all sub-queries of the electronic medical record text and performing preliminary pre-screening on the sub-queries;
step 4, training a query quality prediction model, and selecting the optimal sub-query from the sub-queries output in the pre-screening in the step 3;
and 5, combining the query intention obtained in the step 2 with the optimal sub-query output in the step 4 to obtain a final reconstructed query.
Each step is described in detail below.
Step 1: preprocessing electronic medical record text and medical literature text in a data set
As an example, the dataset used is from the TREC CDS Track dataset, including electronic medical record text and medical literature sets. Where the electronic medical record text is from 90 query topics (Topic) defined by the TREC conference 2014-2016, the medical record data is from the ICU clinical database MIMIC-III, and we select the description field (descriptionfield) in each Topic as an original query. FIG. 2 illustrates an example of a query Topic (Topic). On the other hand, the medical literature is a collection of over 73 ten thousand medically relevant documents from the U.S. biomedical and life sciences text database PubMed Central.
In the step 1, the electronic medical record text and the medical literature text need to be preprocessed, and the method specifically comprises the following steps:
1.4, extracting plain text
This step is not required because the electronic medical record text itself is plain text. The medical literature text is a webpage file stored in an XML format, useless CSS and JS codes are required to be removed, and required plain text data including titles, abstracts, keywords and body parts of the literature are taken out according to an XML tag, so that the preprocessed literature text has a uniform format.
Providing the plain text of the electronic medical record and the extracted plain text of the medical literature to the step 1.2;
1.5, removing stop words
And removing stop words in the plain text by utilizing the preprocessed vocabulary, wherein the stop words comprise words without semantic information and words with high use frequency.
The result after the stop word is removed is provided to step 1.3;
1.6 restoring part of speech
Different parts of speech are integrated and restored to root words, words with the same meaning in English have different tense changes, and the words are restored in terms of speech.
After the part of speech is restored, the text preprocessing work of the step 1 is completed, the preprocessed electronic medical record text is provided for the step 2 and the step 3, and the preprocessed medical literature text is provided for the step 4.
Step 2: method for predicting query intention of electronic medical record text by training SVM (support vector machine) three classifiers
And 2, training an SVM classifier to judge the query intention by using the preprocessed electronic medical record text obtained in the step 1 as a training set, and specifically comprising the following steps.
2.1, labeling three classification labels for each electronic medical record text in the training set: if the text content of the electronic medical record belongs to Diagnosis (Diagnosis), marking as 1; if the text content of the electronic medical record belongs to a Treatment scheme (Treatment), the text content is marked as 2; if the text content of the electronic medical record belongs to the diagnosis and detection means (Test), the label is 3. The annotated results are provided to step 2.2.
2.2, training a three-classifier.
The training of the three classifiers uses the existing SVM algorithm, and the features of the electronic medical record text and the three classification labels labeled in the step 2.1 are required to be input during the training. The training of the classifier requires the use of two features of the electronic medical record text: (1) a TF-IDF value; (2) and (4) semantic information.
(1) TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. Where Term Frequency (TF) refers to the frequency with which a given term appears in the document. The Inverse Document Frequency (IDF) is obtained by dividing the total document number by the document number containing the word, and taking the obtained quotient to be a logarithm with the base 10. The TF-IDF value is the product of these two values and has the formula
Figure BDA0002371271260000051
Wherein n isωRepresenting the number of occurrences of the word omega in the document, N representing the total number of words in the document, NdRepresenting the total number of documents in the corpus, NωRepresenting the number of files in the corpus that contain the word ω.
(2) The semantic information refers to three parts of information: whether diagnostic results are included (value 0/1), whether the check is complete (value 0/1), query text length (value 0-200).
The trained three classifiers are provided to step 2.3.
And 2.3, inputting the electronic medical record text into the trained three classifiers, and providing classification results (namely query intentions) to the step 5.
And step 3: all sub-queries of the electronic medical record text are obtained and subjected to preliminary pre-screening
Theoretically, a query containing n query terms may obtain sub-queries of a number level, for example, the sub-query into which the query statement "fe ver core header" containing 3 query terms may be split includes "fe ver," core, "" header, "" fe ver core, "" fe ver header, "" cog header, "" fe ver core. It is impractical to exhaust all possible sub-queries for ranking, so the sub-queries need to be pre-screened first. Sub-query pre-screening comprises the following steps.
And 3.1, selecting the subqueries with the length of 3-10 from all the subqueries of the electronic medical record text. The sub-query length refers to the number of words in the query. Research shows that the query length for optimizing the retrieval effect is between 3 and 6, and the maximum length threshold is defined to be 10 in consideration of the electronic medical record long text targeted by the invention. The result is provided to step 3.2.
And 3.2, calculating the average mutual information quantity of each sub-query obtained in the step 3.1, and selecting the 30 sub-queries with the highest mutual information quantity. The average mutual information quantity calculation formula of the sub-queries is as follows:
Figure BDA0002371271260000052
where n (x, y) represents the frequency with which the word x and the word y appear simultaneously in a document with a window size of 25 throughout the corpus, and n (x), n (y) represent the frequency with which the word x and the word y appear in the corpus, respectively. N is a radical ofcRepresenting the number of words in the entire corpus. And calculating mutual information quantity of any two words in one sub-query, and taking the weighted average of the mutual information quantity as the average mutual information quantity of the sub-queries.
And step 3, finally obtaining 30 sub-queries after pre-screening, and providing the result to step 4.
And 4, step 4: training a query quality prediction model, and selecting the optimal sub-query from the pre-screened sub-queries
4.1 labeling sub-queries with query quality scores
And (3) performing one round of retrieval on each pre-screened sub-query obtained in the step (3), wherein the retrieved target document set is from the medical document text set preprocessed in the step (1). The search engine used Indri5.11 in the Lemur open source project. And comparing the retrieval result with the evaluation standard provided by the TREC conference, calculating to obtain the average accuracy score of the retrieval, and marking the average accuracy score as the query quality score of the sub-query. The sub-queries with labeled query quality scores are provided to step 4.2 as the result of this step.
4.2 training query quality prediction model
The existing SVMRank algorithm is used for training the query quality prediction model, and indexes capable of representing the sub-query quality and the query quality scores marked in the step 4.1 need to be input during training.
Model training requires the use of the following indices, calculated for each sub-query in the training set: (1) an inverse document frequency correlation index; (2) simplifying the query definition index; (3) corpus/query similarity feature index; (4) and querying the expandability index.
The symbolic meaning used in this step is defined before introducing these indices separately. For a query Q, assume that it contains the query term ω1,…ωnN (ω) in corpus Ci) Representing a query term omegaiFrequency of occurrence in corpus, n (ω)ij) Representing query words omega in a corpusij(i ≠ j) frequency of simultaneous occurrence in a window of 25 words in length, NcRepresenting the total number of words contained in the corpus, NωNumber of documents in which the query word ω appears, NdRepresenting the number of all documents in the corpus. Pc(ω) represents the probability of the occurrence of the query word ω in the corpus, P (ω | Q) represents the probability of the occurrence of ω in the query sentence Q, SωA set of synonyms representing the word ω.
(1) The inverse document frequency correlation index calculation formula is as follows:
Figure BDA0002371271260000061
wherein N isωNumber of documents containing word ω, NdThe total number of documents in the corpus. For each sub-query, the sum, maximum, standard deviation, arithmetic mean, geometric mean, and harmonic mean of each query term IDF value are calculated together as query quality indicators.
(2) The simplified query definition index calculation formula is as follows:
Figure BDA0002371271260000062
wherein P isml(ω | Q) is the frequency of occurrence of the word ω in the query Q, Pc(ω) the frequency with which the word ω appears in the corpus.
(3) The corpus/query similarity characteristic index calculation formula is as follows:
Figure BDA0002371271260000071
like the inverse document frequency correlation index, the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of the SCQ value of each query term are calculated together as the query quality index.
(4) Query extensibility index
The invention firstly provides an index reflecting the query expansion performance, namely a query expansion index. The calculation formula is as follows:
Figure BDA0002371271260000072
where S ω is a synonym set of query terms ω and P (α | Q) refers to the probability of occurrence of query terms α in the query model.
And (4) providing the trained query quality prediction model to a step 4.3.
4.3, calculating 4 indexes representing the sub-query quality in the step 4.2 for each pre-screened sub-query obtained in the step 3, and inputting the indexes into the query quality prediction model obtained by training in the step 4.2 to obtain the query quality score of the sub-query. And selecting the sub-query with the highest query quality score in the 30 sub-queries as the optimal sub-query, and providing the result to the step 5.
And 5: obtaining final reconstruction query by combining query intention and optimal sub-query
And combining the query intention obtained in the step 2 and the optimal sub-query obtained in the step 4 to obtain a reconstructed query serving as a final result.

Claims (1)

1. A query reconstruction method for electronic medical record description is characterized by comprising
Step 1, preprocessing an electronic medical record text and a medical literature text in a data set;
step 2, training an SVM classifier to predict the query intention of the electronic medical record text;
step 3, acquiring all sub-queries of the electronic medical record text and performing preliminary pre-screening on the sub-queries;
step 4, training a query quality prediction model, and selecting the optimal sub-query from the sub-queries output in the pre-screening in the step 3;
step 5, combining the query intention obtained in the step 2 and the optimal sub-query output in the step 4 to obtain a final reconstruction query;
wherein
Step 1: preprocessing electronic medical record text and medical literature text in a data set
1.1, extracting plain text
This step is not required because the electronic medical record text itself is plain text. The medical literature text is a webpage file stored in an XML format, useless CSS and JS codes are required to be removed, and required plain text data including titles, abstracts, keywords and body parts of the literature are taken out according to an XML tag, so that the preprocessed literature text has a uniform format.
Providing the plain text of the electronic medical record and the extracted plain text of the medical literature to the step 1.2;
1.2, removing stop words
And removing stop words in the plain text by utilizing the preprocessed vocabulary, wherein the stop words comprise words without semantic information and words with high use frequency.
The result after the stop word is removed is provided to step 1.3;
1.3 restoring part of speech
Different parts of speech are integrated and restored to root words, words with the same meaning in English have different tense changes, and the words are restored in terms of speech.
After the part of speech is restored, the text preprocessing work of the step 1 is completed, the preprocessed electronic medical record text is provided for the step 2 and the step 3, and the preprocessed medical literature text is provided for the step 4.
Step 2: method for predicting query intention of electronic medical record text by training SVM (support vector machine) three classifiers
And 2, training an SVM classifier to judge the query intention by using the preprocessed electronic medical record text obtained in the step 1 as a training set, and specifically comprising the following steps.
2.1, labeling three classification labels for each electronic medical record text in the training set: if the text content of the electronic medical record belongs to Diagnosis (Diagnosis), marking as 1; if the text content of the electronic medical record belongs to a Treatment scheme (Treatment), the text content is marked as 2; if the text content of the electronic medical record belongs to the diagnosis and detection means (Test), the label is 3. The annotated results are provided to step 2.2.
2.2, training a three-classifier.
The training of the three classifiers uses the existing SVM algorithm, and the features of the electronic medical record text and the three classification labels labeled in the step 2.1 are required to be input during the training. The training of the classifier requires the use of two features of the electronic medical record text: (1) a TF-IDF value; (2) and (4) semantic information.
(1) TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. Where Term Frequency (TF) refers to the frequency with which a given term appears in the document. The Inverse Document Frequency (IDF) is obtained by dividing the total document number by the document number containing the word, and taking the obtained quotient to be a logarithm with the base 10. The TF-IDF value is the product of these two values and has the formula
Figure FDA0002371271250000021
Wherein n isωRepresenting the number of occurrences of the word omega in the document, N representing the total number of words in the document, NdRepresenting the total number of documents in the corpus, NωRepresenting the number of files in the corpus that contain the word ω.
(2) The semantic information refers to three parts of information: whether diagnostic results are included (value 0/1), whether the check is complete (value 0/1), query text length (value 0-200).
The trained three classifiers are provided to step 2.3.
And 2.3, inputting the electronic medical record text into the trained three classifiers, and providing classification results (namely query intentions) to the step 5.
And step 3: all sub-queries of the electronic medical record text are obtained and subjected to preliminary pre-screening
Theoretically, a query containing n query terms may obtain sub-queries of a number level, for example, the sub-query into which the query statement "fe ver core header" containing 3 query terms may be split includes "fe ver," core, "" header, "" fe ver core, "" fe ver header, "" cog header, "" fe ver core. It is impractical to exhaust all possible sub-queries for ranking, so the sub-queries need to be pre-screened first. Sub-query pre-screening comprises the following steps.
And 3.1, selecting the subqueries with the length of 3-10 from all the subqueries of the electronic medical record text. The sub-query length refers to the number of words in the query. Research shows that the query length for optimizing the retrieval effect is between 3 and 6, and the maximum length threshold is defined to be 10 in consideration of the electronic medical record long text targeted by the invention. The result is provided to step 3.2.
And 3.2, calculating the average mutual information quantity of each sub-query obtained in the step 3.1, and selecting the 30 sub-queries with the highest mutual information quantity. The average mutual information quantity calculation formula of the sub-queries is as follows:
Figure FDA0002371271250000031
where n (x, y) represents the frequency with which the word x and the word y appear simultaneously in a document with a window size of 25 throughout the corpus, and n (x), n (y) represent the frequency with which the word x and the word y appear in the corpus, respectively. N is a radical ofcRepresenting the number of words in the entire corpus. And calculating mutual information quantity of any two words in one sub-query, and taking the weighted average of the mutual information quantity as the average mutual information quantity of the sub-queries.
And step 3, finally obtaining 30 sub-queries after pre-screening, and providing the result to step 4.
And 4, step 4: training a query quality prediction model, and selecting the optimal sub-query from the pre-screened sub-queries
4.1 labeling sub-queries with query quality scores
And (3) performing one round of retrieval on each pre-screened sub-query obtained in the step (3), wherein the retrieved target document set is from the medical document text set preprocessed in the step (1). The search engine used Indri5.11 in the Lemur open source project. And comparing the retrieval result with the evaluation standard provided by the TREC conference, calculating to obtain the average accuracy score of the retrieval, and marking the average accuracy score as the query quality score of the sub-query. The sub-queries with labeled query quality scores are provided to step 4.2 as the result of this step.
4.2 training query quality prediction model
The existing SVMRank algorithm is used for training the query quality prediction model, and indexes capable of representing the sub-query quality and the query quality scores marked in the step 4.1 need to be input during training.
Model training requires the use of the following indices, calculated for each sub-query in the training set: (1) an inverse document frequency correlation index; (2) simplifying the query definition index; (3) corpus/query similarity feature index; (4) and querying the expandability index.
The symbolic meaning used in this step is defined before introducing these indices separately. For a query Q, assume that it contains the query term ω1,…ωnN (ω) in corpus Ci) Representing a query term omegaiFrequency of occurrence in corpus, n (ω)ij) Representing query words omega in a corpusij(i ≠ j) frequency of simultaneous occurrence in a window of 25 words in length, NcRepresenting the total number of words contained in the corpus, NωNumber of documents in which the query word ω appears, NdRepresenting the number of all documents in the corpus. Pc(ω) represents the probability of the occurrence of the query word ω in the corpus, P (ω | Q) represents the probability of the occurrence of ω in the query sentence Q, SωA set of synonyms representing the word ω.
(1) The inverse document frequency correlation index calculation formula is as follows:
Figure FDA0002371271250000041
wherein N isωNumber of documents containing word ω, NdThe total number of documents in the corpus. For each sub-query, the sum, maximum, standard deviation, arithmetic mean, geometric mean, and harmonic mean of each query term IDF value are calculated together as query quality indicators.
(2) The simplified query definition index calculation formula is as follows:
Figure FDA0002371271250000042
wherein P isml(ω | Q) is the frequency of occurrence of the word ω in the query Q, Pc(ω) the frequency with which the word ω appears in the corpus.
(3) The corpus/query similarity characteristic index calculation formula is as follows:
Figure FDA0002371271250000043
like the inverse document frequency correlation index, the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of the SCQ value of each query term are calculated together as the query quality index.
(4) Query extensibility index
The invention firstly provides an index reflecting the query expansion performance, namely a query expansion index. The calculation formula is as follows:
Figure FDA0002371271250000044
where S ω is a synonym set of query terms ω and P (α | Q) refers to the probability of occurrence of query terms α in the query model.
And (4) providing the trained query quality prediction model to a step 4.3.
4.3, calculating 4 indexes representing the sub-query quality in the step 4.2 for each pre-screened sub-query obtained in the step 3, and inputting the indexes into the query quality prediction model obtained by training in the step 4.2 to obtain the query quality score of the sub-query. And selecting the sub-query with the highest query quality score in the 30 sub-queries as the optimal sub-query, and providing the result to the step 5.
And 5: obtaining final reconstruction query by combining query intention and optimal sub-query
And combining the query intention obtained in the step 2 and the optimal sub-query obtained in the step 4 to obtain a reconstructed query serving as a final result.
CN202010051309.9A 2020-01-17 2020-01-17 Query reconstruction method for electronic medical record description Active CN111292818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010051309.9A CN111292818B (en) 2020-01-17 2020-01-17 Query reconstruction method for electronic medical record description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010051309.9A CN111292818B (en) 2020-01-17 2020-01-17 Query reconstruction method for electronic medical record description

Publications (2)

Publication Number Publication Date
CN111292818A true CN111292818A (en) 2020-06-16
CN111292818B CN111292818B (en) 2022-04-19

Family

ID=71030814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010051309.9A Active CN111292818B (en) 2020-01-17 2020-01-17 Query reconstruction method for electronic medical record description

Country Status (1)

Country Link
CN (1) CN111292818B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022033073A1 (en) * 2020-08-12 2022-02-17 哈尔滨工业大学 Cognitive service-oriented user intention recognition method and system
CN115376643A (en) * 2022-10-26 2022-11-22 神州医疗科技股份有限公司 Case custom retrieval method and device, electronic equipment and computer readable medium
CN117789907A (en) * 2024-02-28 2024-03-29 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion
CN118016314A (en) * 2024-04-08 2024-05-10 北京大学第三医院(北京大学第三临床医学院) Medical data input optimization method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597614A (en) * 2018-04-12 2018-09-28 上海熙业信息科技有限公司 A kind of auxiliary diagnosis decision-making technique based on Chinese electronic health record
CN110364234A (en) * 2019-06-26 2019-10-22 浙江大学 Electronic health record intelligent storage analyzing search system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597614A (en) * 2018-04-12 2018-09-28 上海熙业信息科技有限公司 A kind of auxiliary diagnosis decision-making technique based on Chinese electronic health record
CN110364234A (en) * 2019-06-26 2019-10-22 浙江大学 Electronic health record intelligent storage analyzing search system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EDWARD W HUANG: ""Framing Electronic Medical Records as Polylingual Documents in Query Expansion"", 《AMIA ANNU SYMP PROC.》 *
王文斌: ""电子病历检索中基于词权调整的查询重构"", 《计算机应用与软件》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022033073A1 (en) * 2020-08-12 2022-02-17 哈尔滨工业大学 Cognitive service-oriented user intention recognition method and system
CN115376643A (en) * 2022-10-26 2022-11-22 神州医疗科技股份有限公司 Case custom retrieval method and device, electronic equipment and computer readable medium
CN117789907A (en) * 2024-02-28 2024-03-29 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion
CN117789907B (en) * 2024-02-28 2024-05-10 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion
CN118016314A (en) * 2024-04-08 2024-05-10 北京大学第三医院(北京大学第三临床医学院) Medical data input optimization method and device and electronic equipment

Also Published As

Publication number Publication date
CN111292818B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN111292818B (en) Query reconstruction method for electronic medical record description
CN109299239B (en) ES-based electronic medical record retrieval method
Nadkarni et al. UMLS concept indexing for production databases: a feasibility study
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
US20130060793A1 (en) Extracting information from medical documents
Zhu et al. Using Discharge Summaries to Improve Information Retrieval in Clinical Domain.
CN115983233B (en) Electronic medical record duplicate checking rate estimation method based on data stream matching
Vo et al. Self-training on refined clause patterns for relation extraction
Peng et al. A self-attention based deep learning method for lesion attribute detection from CT reports
Plaza et al. Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts
Rizzo et al. ICD code retrieval: Novel approach for assisted disease classification
Xu et al. Learning to refine expansion terms for biomedical information retrieval using semantic resources
Gupta et al. Frequent item-set mining and clustering based ranked biomedical text summarization
Nikiforovskaya et al. Automatic generation of reviews of scientific papers
Trabelsi et al. A hybrid deep model for learning to rank data tables
Zhou et al. Converting semi-structured clinical medical records into information and knowledge
Soni et al. Patient cohort retrieval using transformer language models
Wolyn et al. Summarization assessment methodology for multiple corpora using queries and classification for functional evaluation
Saba et al. Question-Answering Based Summarization of Electronic Health Records using Retrieval Augmented Generation
Diaz et al. Towards automatic generation of context-based abstractive discharge summaries for supporting transition of care
El Kah et al. A review on applied natural language processing to electronic health records
Sheng et al. PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark
Soualmia et al. Matching health information seekers' queries to medical terms
Silachan et al. Domain ontology health informatics service from text medical data classification
Safari et al. An enhancement on Clinical Data Analytics Language (CliniDAL) by integration of free text concept search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant