CN111292818A

CN111292818A - Query reconstruction method for electronic medical record description

Info

Publication number: CN111292818A
Application number: CN202010051309.9A
Authority: CN
Inventors: 方钰; 姚窅; 陆明名; 黄欣; 翟鹏珺
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-16
Anticipated expiration: 2040-01-17
Also published as: CN111292818B

Abstract

A query reconstruction method for electronic medical record description, which is a query reconstruction method for electronic medical record long text in clinical decision support. Query reconstruction in information retrieval refers to the process of automatically processing the information input by the user to form a new query expression. In the medical field, medical literature retrieval is an important application of clinical decision support. It uses electronic medical record texts as query input to obtain needed information from massive medical texts. However, the description of electronic medical records is very complex, and it is necessary to perform query reconstruction in order to obtain an effective retrieval effect. In this regard, the present invention describes the long text of electronic medical records with redundant information, uses sub-query segmentation and screening and query quality prediction technology, selects the sub-query with the highest query quality to replace the original query, and predicts the query intent to reconstruct the query, thereby improving the performance of the query. retrieval efficiency.

Description

A Query Reconstruction Method for Electronic Medical Record Description

技术领域technical field

本发明涉及文本检索领域，尤其涉及文本检索中查询的处理。The present invention relates to the field of text retrieval, in particular to the processing of query in text retrieval.

背景技术Background technique

信息检索是从非结构化的大规模数据中找到用户所需信息的过程，是在海量数据中获取关键信息的有效方法。早在上世纪，医学专家就考虑利用数据、模型辅助临床决策，由此提出了临床决策支持系统。临床决策支持系统是一个医疗信息技术应用系统，包括利用信息检索技术预测诊断，即将病历中的病人描述作为查询查找相关医学文献辅助决策。通过这种方法，临床决策支持系统能有效挖掘医疗中的深层数据，提高医疗服务效率，加快医疗信息化进程。Information retrieval is the process of finding the information needed by users from unstructured large-scale data, and it is an effective method to obtain key information in massive data. As early as the last century, medical experts considered the use of data and models to assist clinical decision-making, and thus proposed a clinical decision support system. The clinical decision support system is a medical information technology application system, including the use of information retrieval technology to predict diagnosis, that is, the patient description in the medical record is used as a query to find relevant medical documents to assist decision-making. Through this method, the clinical decision support system can effectively mine the deep data in medical treatment, improve the efficiency of medical service, and speed up the process of medical informatization.

随着医疗卫生事业的发展和科学技术的进步，医疗行业的科技化水平和信息化程度在不断提高，诊断决策支持系统作为临床决策支持系统中的一个非常活跃应用分支，一直是国内外研究与应用的热点。诊断决策支持系统是指能为医生在诊断决策过程中提供辅助支持的计算机应用系统，其可以为临床医生提供大量的医学支持，从而帮助临床医生做出最合理的诊断、选择最佳治疗措施。大量研究表明诊断决策支持系统可以有效解决临床医生疾病诊断过程中知识的局限性问题，并减少人为疏忽、相对降低医疗费用，为医疗质量提供保障。With the development of medical and health services and the advancement of science and technology, the level of technology and informatization in the medical industry is constantly improving. As a very active application branch of clinical decision support systems, the diagnosis decision support system has been a research and development support system at home and abroad. Application hotspots. A diagnosis decision support system refers to a computer application system that can provide auxiliary support for doctors in the process of diagnosis and decision-making. A large number of studies have shown that the diagnosis decision support system can effectively solve the problem of limited knowledge of clinicians in the process of disease diagnosis, reduce human negligence, relatively reduce medical expenses, and provide guarantee for medical quality.

为了更好地研究医疗文本检索技术，文本检索会议TREC在2014年提出了临床决策支持(CDS)任务。该任务给出电子病历的描述作为输入查询，参赛者在已有医疗文献集合中搜索返回与该查询最相关的文档，提供给医生辅助判断患者的真实需求。作为查询的电子病历以自由文本的形式存储，病历数据来源于ICU临床数据库MIMIC-III，而作为检索库的目标文档集则来自美国生物医学和生命科学全文本数据库PubMed Central。To better study medical text retrieval techniques, the Text Retrieval Conference TREC proposed the Clinical Decision Support (CDS) task in 2014. In this task, the description of the electronic medical record is given as the input query, and the contestant searches the existing medical document collection to return the most relevant documents to the query, which are provided to the doctor to assist in judging the real needs of the patient. The electronic medical record used as a query is stored in the form of free text. The medical record data comes from the ICU clinical database MIMIC-III, and the target document set as the retrieval library comes from the American biomedical and life science full-text database PubMed Central.

从TREC CDS任务目前已有的研究成果来看，文献检索的主要工作集中在对原始查询语句的处理上，主流方法是基于关键词的查询扩展，而针对电子病历长文本的查询处理工作非常稀少。电子病历文本具有高度多义性，文本存在大量冗余且语义不清晰，它的处理方法无疑是医疗文献检索任务的难点。Judging from the existing research results of the TREC CDS task, the main work of document retrieval focuses on the processing of original query sentences. The mainstream method is query expansion based on keywords, and the query processing work for long texts of electronic medical records is very rare. . The electronic medical record text is highly ambiguous, with a lot of redundancy and unclear semantics. Its processing method is undoubtedly a difficult point in the task of medical document retrieval.

发明内容SUMMARY OF THE INVENTION

从TREC CDS任务目前已有的研究成果来看，文献检索的主要工作集中在对原始查询语句的处理上，主流方法是基于关键词的查询扩展，而针对电子病历长文本的查询处理工作非常稀少。普通查询长度在几到十几个查询词，而临床决策中任一电子病历描述平均长度在50-200个查询词，大多数商业及学术搜索引擎在处理这种长查询时效果并不理想，这意味着它将原始查询缩减为较短查询的任务留给了用户。另一方面，明确的查询意图可以进行针对性检索以提高检索效率。Judging from the existing research results of the TREC CDS task, the main work of document retrieval focuses on the processing of original query sentences. The mainstream method is query expansion based on keywords, and the query processing work for long texts of electronic medical records is very rare. . The length of a common query is several to a dozen query words, while the average length of any electronic medical record description in clinical decision-making is 50-200 query words. Most commercial and academic search engines are not ideal when dealing with such long queries. This means that it leaves the task of reducing the original query to a shorter query to the user. On the other hand, a clear query intent can be targeted for retrieval to improve retrieval efficiency.

针对上述问题，本发明以重构查询语句为目的，采用SVM分类器获取查询语句的查询意图，生成电子病历的子查询并对其进行筛选，之后通过训练查询质量预测模型获取子查询中最优子查询，将其与查询意图相结合生成重构的查询语句。Aiming at the above problems, the present invention aims to reconstruct the query statement, adopts the SVM classifier to obtain the query intent of the query statement, generates and filters the sub-queries of the electronic medical record, and then obtains the optimal sub-query by training the query quality prediction model. Subqueries, which are combined with query intents to generate refactored query statements.

为了实现上述目的，本发明给出的技术方案为：In order to achieve the above object, the technical scheme provided by the present invention is:

本发明提供一种针对电子病历描述的查询重构方法，包括：The present invention provides a query reconstruction method for electronic medical record description, including:

步骤1、对数据集中的电子病历文本和医疗文献文本进行预处理；Step 1. Preprocess the electronic medical record text and medical literature text in the data set;

步骤2、训练SVM分类器对电子病历文本进行查询意图预测；Step 2, train the SVM classifier to predict the query intent of the electronic medical record text;

步骤3、获取电子病历文本的所有子查询并对其进行初步预筛选；Step 3. Obtain all sub-queries of the electronic medical record text and perform preliminary pre-screening on them;

步骤4、训练查询质量预测模型，从步骤3中预筛选输出的子查询中选取最优子查询；Step 4. Train the query quality prediction model, and select the optimal sub-query from the sub-queries output by the pre-screening in step 3;

步骤5、结合步骤2得到的查询意图与步骤4输出的最优子查询得到最终的重构查询。Step 5: Combine the query intent obtained in Step 2 with the optimal sub-query output in Step 4 to obtain the final reconstructed query.

有益效果beneficial effect

本发明针对具有冗余信息的电子病历描述长文本做查询重构处理，包括查询意图的预测和基于查询质量预测技术的原始查询的缩减。本发明分析原始查询语义，训练了一个分类器实现查询意图的判断。另一方面本发明将分析原始查询的所有子查询集，通过一组查询质量指标表示每个子查询，并首次提出了一个反映查询扩展性能的指标，在此基础上训练查询质量预测模型获取查询质量最高的子查询替代原始查询。The invention performs query reconstruction processing for the long description text of electronic medical records with redundant information, including prediction of query intention and reduction of original query based on query quality prediction technology. The invention analyzes the original query semantics, and trains a classifier to realize the judgment of the query intention. On the other hand, the present invention analyzes all sub-query sets of the original query, expresses each sub-query through a set of query quality indicators, and proposes an index reflecting query expansion performance for the first time, and on this basis, trains a query quality prediction model to obtain query quality The highest subquery replaces the original query.

本发明在TREC CDS数据集上进行了查询重构实验，并观察到性能的显著改善，这也证实了查询重构对检索结果的提升。其针对电子病历长文本实现的查询重构方法，对解决临床医生疾病诊断过程中知识的局限性问题，以及对减少人为疏忽、相对降低医疗费用、为医疗质量提供保障都有重大意义。The present invention conducts a query reconstruction experiment on the TREC CDS data set, and observes a significant improvement in performance, which also confirms that query reconstruction improves retrieval results. The query reconstruction method implemented for the long text of electronic medical records is of great significance for solving the problem of limited knowledge of clinicians in the process of disease diagnosis, reducing human negligence, relatively reducing medical expenses, and providing guarantee for medical quality.

附图说明Description of drawings

附图是对本发明的进一步说明，并且构成说明书的一部分，与下面的具体实施方式一起用于解释本发明，但不构成对本发明的限制。在附图中：The accompanying drawings are further descriptions of the present invention and constitute a part of the specification, and are used to explain the present invention together with the following specific embodiments, but do not constitute a limitation of the present invention. In the attached image:

图1为查询重构方法的流程示意图；FIG. 1 is a schematic flowchart of a query reconstruction method;

图2为TREC查询主题示例；Figure 2 is an example of a TREC query topic;

图3为利用该查询重构处理电子病历文本的结果。Figure 3 shows the result of processing electronic medical record text using the query reconstruction.

具体实施方式Detailed ways

本发明的具体实施过程如图1所示，包括以下5个步骤：The specific implementation process of the present invention, as shown in Figure 1, includes the following 5 steps:

各个步骤详述如下。The individual steps are detailed below.

步骤1：对数据集中的电子病历文本和医疗文献文本进行预处理Step 1: Preprocess the EMR text and medical literature text in the dataset

作为实施例，使用到的数据集来自TREC CDS Track数据集，包括电子病历文本和医疗文献集。其中电子病历文本来自于TREC会议2014-2016定义的90个查询主题(Topic)，病历数据来源于ICU临床数据库MIMIC-III，我们选取每个主题中的描述域(descriptionfield)作为一个原始查询。图2展示了一个查询主题(Topic)的示例。另一方面，医疗文献集为来自美国生物医学和生命科学文本数据库PubMed Central的73万余份医疗相关文献。As an example, the data set used is from the TREC CDS Track data set, including electronic medical record text and medical literature set. The electronic medical record text comes from the 90 query topics (Topics) defined by the TREC conference 2014-2016, and the medical record data comes from the ICU clinical database MIMIC-III. We select the description field in each topic as an original query. Figure 2 shows an example of a query topic (Topic). On the other hand, the medical literature collection is more than 730,000 medical-related documents from the American biomedical and life science text database PubMed Central.

步骤1中需要对电子病历文本和医疗文献文本进行预处理，具体包括以下步骤：In step 1, the electronic medical record text and medical document text need to be preprocessed, which specifically includes the following steps:

1.4、提取纯文本1.4. Extract plain text

因为电子病历文本本身就是纯文本，所以无需此步骤。而医疗文献文本是以XML格式存储的网页文件，需要去除其中无用的CSS和JS代码，并根据XML标签取出需要的纯文本数据，包括文献的标题、摘要、关键词和正文部分，使预处理后的文献文本拥有统一格式。This step is not required because the EHR text itself is plain text. The medical literature text is a web page file stored in XML format. It is necessary to remove the useless CSS and JS codes, and extract the required plain text data according to the XML tags, including the title, abstract, keywords and body parts of the literature. The subsequent literature text has a uniform format.

将电子病历纯文本和医疗文献提取后的纯文本提供给步骤1.2；Provide the plain text of the electronic medical record and the extracted plain text of medical documents to step 1.2;

1.5、去除停用词1.5. Remove stop words

利用预处理词表去除纯文本中的停用词，包括一些不含有语义信息的词汇，以及使用频率过高的词汇。Use the preprocessing vocabulary to remove stop words in plain text, including some words that do not contain semantic information, and words that are used too frequently.

去除停用词后结果提供给步骤1.3；After removing stop words, the result is provided to step 1.3;

1.6、还原词性1.6, restore part of speech

将不同的词性整合还原为词根，英文中同一个含义的词会有不同时态的变化，将这些词进行词性还原。The integration of different parts of speech is restored to the root of the word. Words with the same meaning in English will have different tense changes, and these words will be restored by part of speech.

还原词性后即完成了步骤1的文本预处理工作，将预处理后的电子病历文本提供给步骤2和步骤3，而将预处理后的医疗文献文本提供给步骤4。After restoring the part of speech, the text preprocessing work of step 1 is completed, and the preprocessed electronic medical record text is provided to steps 2 and 3, and the preprocessed medical document text is provided to step 4.

步骤2：训练SVM三分类器对电子病历文本进行查询意图预测Step 2: Train the SVM three-classifier to predict the query intent of the electronic medical record text

步骤2利用步骤1中得到的预处理后的电子病历文本作为训练集来训练SVM分类器进行查询意图的判断，具体包括以下步骤。Step 2 uses the preprocessed electronic medical record text obtained in step 1 as a training set to train the SVM classifier to judge the query intent, which specifically includes the following steps.

2.1、为训练集中的每一个电子病历文本标注三分类标签：若电子病历文本内容属于诊断(Diagnosis)，标注为1；若电子病历文本内容属于治疗方案(Treatment)，标注为2；若电子病历文本内容属于诊断检测手段(Test)，标注为3。标注后的结果提供给步骤2.2。2.1. Label each electronic medical record text in the training set with three-category labels: if the electronic medical record text content belongs to Diagnosis, it is marked as 1; if the electronic medical record text content belongs to the treatment plan (Treatment), it is marked as 2; The text content belongs to the diagnostic testing method (Test), marked as 3. The annotated results are provided to step 2.2.

2.2、训练三分类器。2.2. Train three classifiers.

三分类器的训练使用现有的支持向量机SVM算法，训练时需要输入电子病历文本的特征和步骤2.1中标注的三种分类标签。分类器的训练需要用到两个电子病历文本的特征：(1)TF-IDF值；(2)语义信息。The training of the three classifiers uses the existing support vector machine SVM algorithm, and the characteristics of the electronic medical record text and the three classification labels marked in step 2.1 need to be input during training. The training of the classifier needs to use two features of electronic medical record text: (1) TF-IDF value; (2) semantic information.

(1)TF-IDF是一种统计方法，用以评估一字词对于一个语料库中的其中一份文件的重要程度。其中词频(term frequency，TF)指的是某一个给定的词语在该文件中出现的频率。逆向文件频率(inverse document frequency，IDF)是由总文件数目除以包含该词语的文件数目，再将得到的商取以10为底的对数得到。TF-IDF值是这两个值的乘积，公式为(1) TF-IDF is a statistical method to evaluate the importance of a word to one of the documents in a corpus. The term frequency (TF) refers to the frequency with which a given word appears in the document. The inverse document frequency (IDF) is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm to the base 10 of the obtained quotient. The TF-IDF value is the product of these two values, and the formula is

其中n_ω表示文件中词ω出现的次数，N表示文件中的总词数，N_d表示语料库中文件总数，N_ω表示语料库中包含词ω的文件数。where _nω represents the number of occurrences of the word _ω in the file, N represents the total number of words in the file, Nd represents the total number of files in the corpus, and _Nω represents the number of files in the corpus that contain the word ω.

(2)语义信息指三部分信息：是否包含诊断结果(值为0/1)、是否表明已完成检查(值为0/1)、查询文本长度(值为0-200)。(2) Semantic information refers to three parts of information: whether the diagnosis result is included (value 0/1), whether the inspection has been completed (value 0/1), and the length of query text (value 0-200).

将训练得到的三分类器提供给步骤2.3。Feed the trained three-classifier to step 2.3.

2.3、将电子病历文本输入已训练好的三分类器，并将分类结果(即查询意图)提供给步骤5。2.3. Input the electronic medical record text into the trained three-classifier, and provide the classification result (ie, query intent) to step 5.

步骤3：获取电子病历文本的所有子查询并对其进行初步预筛选Step 3: Get all subqueries for EMR text and perform initial pre-screening on them

理论上一个包含n个查询词的查询可以得到指数量级个子查询，例如含3个查询词的查询语句“fever cough headache”可拆分成的子查询有“fever”、“cough”、“headache”、“fever cough”、“fever headache”、“cough headache”、“fever cough headache”。穷举所有可能的子查询进行排序是不现实的，所以首先需要对子查询进行预筛选。子查询预筛选包含以下步骤。In theory, a query containing n query words can get exponential sub-queries. For example, a query containing 3 query words "fever cough headache" can be split into sub-queries including "fever", "cough", "headache" ", "fever cough", "fever headache", "cough headache", "fever cough headache". It is impractical to exhaustively list all possible subqueries for sorting, so the subqueries need to be pre-filtered first. Subquery prefiltering consists of the following steps.

3.1、在电子病历文本的所有子查询中选取长度在3-10之间的子查询。子查询长度指的是查询中的单词数目。研究表明使检索效果最佳的查询长度在3-6之间，考虑到本发明针对的电子病历长文本，将长度最大阈值定义在10。结果提供给步骤3.2。3.1. Select sub-queries whose length is between 3 and 10 in all sub-queries of the electronic medical record text. Subquery length refers to the number of words in the query. Research shows that the query length for the best retrieval effect is between 3 and 6. Considering the long text of the electronic medical record targeted by the present invention, the maximum length threshold is defined as 10. The results are provided to step 3.2.

3.2、计算3.1中得到的每个子查询的平均互信息量，选取互信息量最高的30个子查询。子查询的平均互信息量计算公式如下：3.2. Calculate the average mutual information of each subquery obtained in 3.1, and select 30 subqueries with the highest mutual information. The formula for calculating the average mutual information of a subquery is as follows:

其中n(x,y)表示在整个语料库中，单词x和单词y同时出现在窗口大小为25的文档中的频率，n(x)、n(y)分别表示单词x和单词y在语料库中出现的频率。N_c表示整个语料库的单词数。计算一个子查询中任意两个单词的互信息量，并将它们的加权平均值作为子查询的平均互信息量。where n(x,y) represents the frequency of word x and word y appearing simultaneously in documents with a window size of 25 in the entire corpus, and n(x) and n(y) represent word x and word y in the corpus, respectively frequency of occurrence. N _c represents the number of words in the entire corpus. Calculate the mutual information of any two words in a subquery, and take their weighted average as the average mutual information of the subquery.

步骤3最终得到预筛选后的30个子查询，此结果提供给步骤4。Step 3 finally obtains 30 pre-filtered subqueries, and this result is provided to step 4.

步骤4：训练查询质量预测模型，从预筛选后的子查询中选取最优子查询Step 4: Train the query quality prediction model and select the optimal sub-query from the pre-screened sub-queries

4.1、为子查询标注查询质量分数4.1. Labeling query quality scores for subqueries

对步骤3得到的每一个预筛选后的子查询进行一轮检索，检索的目标文献集来自步骤1中预处理后的医疗文献文本集。搜索引擎使用的是Lemur开源项目中的Indri5.11。将检索结果与TREC会议提供的评价标准对比，计算得到检索的平均准确率得分，并将其标注为该子查询的查询质量分数。标注了查询质量分数的子查询作为此步骤的结果提供给步骤4.2。A round of retrieval is performed for each pre-screened subquery obtained in step 3, and the retrieved target document set comes from the pre-processed medical document text set in step 1. The search engine uses Indri5.11 from the Lemur open source project. The retrieval results are compared with the evaluation criteria provided by the TREC conference, and the average accuracy score of retrieval is calculated and marked as the query quality score of the sub-query. Subqueries annotated with query quality scores are provided to step 4.2 as the result of this step.

4.2、训练查询质量预测模型4.2. Train the query quality prediction model

查询质量预测模型的训练使用现有的SVMRank算法，训练时需要输入可以表征子查询质量的指标和步骤4.1中标注的查询质量分数。The training of the query quality prediction model uses the existing SVMRank algorithm, and the indicators that can characterize the quality of the sub-query and the query quality score marked in step 4.1 need to be input during training.

模型训练需要用到以下指标，对训练集中的每个子查询计算：(1)逆文档频率相关指标；(2)简化查询清晰度指标；(3)语料/查询相似特征指标；(4)查询可扩展性指标。Model training needs to use the following indicators, which are calculated for each subquery in the training set: (1) Inverse document frequency correlation indicator; (2) Simplified query clarity indicator; (3) Corpus/query similarity feature indicator; (4) Query availability Scalability metrics.

在分别介绍这些指标前定义此步骤用到的符号含义。对一个查询Q，假设它包含查询词ω₁,…ω_n，语料库C中n(ω_i)表示查询词ω_i在语料库中出现的频率，n(ω_i,ω_j)表示语料库中查询词ω_i,ω_j(i≠j)同时出现在一个长度为25个单词的窗口中的频率，N_c表示语料库包含的词语总数，N_ω表示出现过查询词ω的文档数，N_d表示语料库中所有文件的数目。P_c(ω)表示语料库中查询词ω出现的概率，P(ω|Q)表示查询语句Q中ω出现的概率，S_ω表示词语ω的同义词集。Define the meaning of the symbols used in this step before introducing these indicators individually. For a query Q, assuming that it contains query words ω ₁ ,...ω _n , n(ω _i ) in corpus C represents the frequency of query word ω _i in the corpus, and n(ω _i ,ω _j ) represents the query word in the corpus The frequency that ω _i , ω _j (i≠j) simultaneously appear in a window of length 25 words, N _c represents the total number of words contained in the corpus, N _ω represents the number of documents in which the query word ω appeared, and N _d represents the corpus The number of all files in . P _c (ω) represents the probability of occurrence of the query word ω in the corpus, P(ω|Q) represents the probability of the occurrence of ω in the query sentence Q, and S _ω represents the synonym set of the word ω.

(1)逆文档频率相关指标计算公式为：(1) The calculation formula of the inverse document frequency correlation index is:

其中N_ω为包含单词ω的文档数，N_d为语料库中总文档数。对于每个子查询，计算每个查询词IDF值的和、最大值、标准偏差、算术平均值、几何平均值和调和平均值共同作为查询质量指标。where _Nω is the number of documents containing the word _ω , and Nd is the total number of documents in the corpus. For each sub-query, the sum, maximum value, standard deviation, arithmetic mean, geometric mean and harmonic mean of each query term IDF value are calculated together as query quality indicators.

(2)简化查询清晰度指标计算公式为：(2) The calculation formula of the simplified query clarity index is:

其中P_ml(ω|Q)为查询Q中单词ω的出现的频率,P_c(ω)单词ω在语料库中出现的频率。where P _ml (ω|Q) is the frequency of the word ω in the query Q, and P _c (ω) is the frequency of the word ω in the corpus.

(3)语料/查询相似特征指标计算公式为：(3) The calculation formula of the corpus/query similarity feature index is:

和逆文档频率相关指标一样，我们计算每个查询词SCQ值的和、最大值、标准偏差、算术平均值、几何平均值和调和平均值共同作为查询质量指标。Like the inverse document frequency-related metrics, we calculate the sum, maximum, standard deviation, arithmetic mean, geometric mean, and harmonic mean of each query term SCQ value together as a query quality index.

(4)查询可扩展性指标(4) Query scalability indicators

本发明首次提出了一个反映查询扩展性能的指标——查询可扩展性指标。计算公式为：The present invention proposes an index reflecting query expansion performance for the first time - query expandability index. The calculation formula is:

其中Sω为查询词ω的同义词集，P(α|Q)指查询模型中查询词α的出现概率。可以认为查询可扩展性越高的查询，其查询质量越高，因为在对它们进行查询扩展后可以检索到更多的相关文档。Among them, Sω is the synonym set of the query word ω, and P(α|Q) refers to the occurrence probability of the query word α in the query model. Queries with higher query scalability can be considered to have higher query quality because more relevant documents can be retrieved after query expansion on them.

将训练得到的查询质量预测模型提供给步骤4.3。Feed the trained query quality prediction model to step 4.3.

4.3、对步骤3得到的每一个预筛选后的子查询，计算步骤4.2中表征子查询质量的4个指标，并将其输入到步骤4.2训练得到的查询质量预测模型得到该子查询的查询质量得分。选取30个子查询中查询质量得分最高的子查询作为最优子查询，结果提供给步骤5。4.3. For each pre-screened subquery obtained in step 3, calculate the 4 indicators that characterize the quality of the subquery in step 4.2, and input them into the query quality prediction model trained in step 4.2 to obtain the query quality of the subquery. Score. The sub-query with the highest query quality score among the 30 sub-queries is selected as the optimal sub-query, and the result is provided to step 5.

步骤5：结合查询意图和最优子查询得到最终的重构查询Step 5: Combine the query intent and the optimal subquery to get the final reconstructed query

将步骤2得到的查询意图和步骤4得到的最优子查询结合作为最终结果的重构查询。The query intent obtained in step 2 and the optimal sub-query obtained in step 4 are combined as the final result of the reconstructed query.

Claims

1. A query reconstruction method for electronic medical record description is characterized by comprising

Step 1, preprocessing an electronic medical record text and a medical literature text in a data set;

step 2, training an SVM classifier to predict the query intention of the electronic medical record text;

step 3, acquiring all sub-queries of the electronic medical record text and performing preliminary pre-screening on the sub-queries;

step 4, training a query quality prediction model, and selecting the optimal sub-query from the sub-queries output in the pre-screening in the step 3;

step 5, combining the query intention obtained in the step 2 and the optimal sub-query output in the step 4 to obtain a final reconstruction query;

wherein

Step 1: preprocessing electronic medical record text and medical literature text in a data set

1.1, extracting plain text

This step is not required because the electronic medical record text itself is plain text. The medical literature text is a webpage file stored in an XML format, useless CSS and JS codes are required to be removed, and required plain text data including titles, abstracts, keywords and body parts of the literature are taken out according to an XML tag, so that the preprocessed literature text has a uniform format.

Providing the plain text of the electronic medical record and the extracted plain text of the medical literature to the step 1.2;

1.2, removing stop words

And removing stop words in the plain text by utilizing the preprocessed vocabulary, wherein the stop words comprise words without semantic information and words with high use frequency.

The result after the stop word is removed is provided to step 1.3;

1.3 restoring part of speech

Different parts of speech are integrated and restored to root words, words with the same meaning in English have different tense changes, and the words are restored in terms of speech.

After the part of speech is restored, the text preprocessing work of the step 1 is completed, the preprocessed electronic medical record text is provided for the step 2 and the step 3, and the preprocessed medical literature text is provided for the step 4.

Step 2: method for predicting query intention of electronic medical record text by training SVM (support vector machine) three classifiers

And 2, training an SVM classifier to judge the query intention by using the preprocessed electronic medical record text obtained in the step 1 as a training set, and specifically comprising the following steps.

2.1, labeling three classification labels for each electronic medical record text in the training set: if the text content of the electronic medical record belongs to Diagnosis (Diagnosis), marking as 1; if the text content of the electronic medical record belongs to a Treatment scheme (Treatment), the text content is marked as 2; if the text content of the electronic medical record belongs to the diagnosis and detection means (Test), the label is 3. The annotated results are provided to step 2.2.

2.2, training a three-classifier.

The training of the three classifiers uses the existing SVM algorithm, and the features of the electronic medical record text and the three classification labels labeled in the step 2.1 are required to be input during the training. The training of the classifier requires the use of two features of the electronic medical record text: (1) a TF-IDF value; (2) and (4) semantic information.

(1) TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. Where Term Frequency (TF) refers to the frequency with which a given term appears in the document. The Inverse Document Frequency (IDF) is obtained by dividing the total document number by the document number containing the word, and taking the obtained quotient to be a logarithm with the base 10. The TF-IDF value is the product of these two values and has the formula

Wherein n is_ωRepresenting the number of occurrences of the word omega in the document, N representing the total number of words in the document, N_dRepresenting the total number of documents in the corpus, N_ωRepresenting the number of files in the corpus that contain the word ω.

(2) The semantic information refers to three parts of information: whether diagnostic results are included (value 0/1), whether the check is complete (value 0/1), query text length (value 0-200).

The trained three classifiers are provided to step 2.3.

And 2.3, inputting the electronic medical record text into the trained three classifiers, and providing classification results (namely query intentions) to the step 5.

And step 3: all sub-queries of the electronic medical record text are obtained and subjected to preliminary pre-screening

Theoretically, a query containing n query terms may obtain sub-queries of a number level, for example, the sub-query into which the query statement "fe ver core header" containing 3 query terms may be split includes "fe ver," core, "" header, "" fe ver core, "" fe ver header, "" cog header, "" fe ver core. It is impractical to exhaust all possible sub-queries for ranking, so the sub-queries need to be pre-screened first. Sub-query pre-screening comprises the following steps.

And 3.1, selecting the subqueries with the length of 3-10 from all the subqueries of the electronic medical record text. The sub-query length refers to the number of words in the query. Research shows that the query length for optimizing the retrieval effect is between 3 and 6, and the maximum length threshold is defined to be 10 in consideration of the electronic medical record long text targeted by the invention. The result is provided to step 3.2.

And 3.2, calculating the average mutual information quantity of each sub-query obtained in the step 3.1, and selecting the 30 sub-queries with the highest mutual information quantity. The average mutual information quantity calculation formula of the sub-queries is as follows:

where n (x, y) represents the frequency with which the word x and the word y appear simultaneously in a document with a window size of 25 throughout the corpus, and n (x), n (y) represent the frequency with which the word x and the word y appear in the corpus, respectively. N is a radical of_cRepresenting the number of words in the entire corpus. And calculating mutual information quantity of any two words in one sub-query, and taking the weighted average of the mutual information quantity as the average mutual information quantity of the sub-queries.

And step 3, finally obtaining 30 sub-queries after pre-screening, and providing the result to step 4.

And 4, step 4: training a query quality prediction model, and selecting the optimal sub-query from the pre-screened sub-queries

4.1 labeling sub-queries with query quality scores

And (3) performing one round of retrieval on each pre-screened sub-query obtained in the step (3), wherein the retrieved target document set is from the medical document text set preprocessed in the step (1). The search engine used Indri5.11 in the Lemur open source project. And comparing the retrieval result with the evaluation standard provided by the TREC conference, calculating to obtain the average accuracy score of the retrieval, and marking the average accuracy score as the query quality score of the sub-query. The sub-queries with labeled query quality scores are provided to step 4.2 as the result of this step.

4.2 training query quality prediction model

The existing SVMRank algorithm is used for training the query quality prediction model, and indexes capable of representing the sub-query quality and the query quality scores marked in the step 4.1 need to be input during training.

Model training requires the use of the following indices, calculated for each sub-query in the training set: (1) an inverse document frequency correlation index; (2) simplifying the query definition index; (3) corpus/query similarity feature index; (4) and querying the expandability index.

The symbolic meaning used in this step is defined before introducing these indices separately. For a query Q, assume that it contains the query term ω₁,…ω_nN (ω) in corpus C_i) Representing a query term omega_iFrequency of occurrence in corpus, n (ω)_i,ω_j) Representing query words omega in a corpus_i,ω_j(i ≠ j) frequency of simultaneous occurrence in a window of 25 words in length, N_cRepresenting the total number of words contained in the corpus, N_ωNumber of documents in which the query word ω appears, N_dRepresenting the number of all documents in the corpus. P_c(ω) represents the probability of the occurrence of the query word ω in the corpus, P (ω | Q) represents the probability of the occurrence of ω in the query sentence Q, S_ωA set of synonyms representing the word ω.

(1) The inverse document frequency correlation index calculation formula is as follows:

wherein N is_ωNumber of documents containing word ω, N_dThe total number of documents in the corpus. For each sub-query, the sum, maximum, standard deviation, arithmetic mean, geometric mean, and harmonic mean of each query term IDF value are calculated together as query quality indicators.

(2) The simplified query definition index calculation formula is as follows:

wherein P is_ml(ω | Q) is the frequency of occurrence of the word ω in the query Q, P_c(ω) the frequency with which the word ω appears in the corpus.

(3) The corpus/query similarity characteristic index calculation formula is as follows:

like the inverse document frequency correlation index, the sum, the maximum value, the standard deviation, the arithmetic mean, the geometric mean and the harmonic mean of the SCQ value of each query term are calculated together as the query quality index.

(4) Query extensibility index

The invention firstly provides an index reflecting the query expansion performance, namely a query expansion index. The calculation formula is as follows:

where S ω is a synonym set of query terms ω and P (α | Q) refers to the probability of occurrence of query terms α in the query model.

And (4) providing the trained query quality prediction model to a step 4.3.

4.3, calculating 4 indexes representing the sub-query quality in the step 4.2 for each pre-screened sub-query obtained in the step 3, and inputting the indexes into the query quality prediction model obtained by training in the step 4.2 to obtain the query quality score of the sub-query. And selecting the sub-query with the highest query quality score in the 30 sub-queries as the optimal sub-query, and providing the result to the step 5.

And 5: obtaining final reconstruction query by combining query intention and optimal sub-query

And combining the query intention obtained in the step 2 and the optimal sub-query obtained in the step 4 to obtain a reconstructed query serving as a final result.