CN114691826B - Medical data information retrieval method based on co-occurrence analysis and spectral clustering - Google Patents

Medical data information retrieval method based on co-occurrence analysis and spectral clustering Download PDF

Info

Publication number
CN114691826B
CN114691826B CN202210234485.5A CN202210234485A CN114691826B CN 114691826 B CN114691826 B CN 114691826B CN 202210234485 A CN202210234485 A CN 202210234485A CN 114691826 B CN114691826 B CN 114691826B
Authority
CN
China
Prior art keywords
score
documents
scores
retrieval
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210234485.5A
Other languages
Chinese (zh)
Other versions
CN114691826A (en
Inventor
陈宣亦
张子成
章斌
朱志安
杨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yunshe Intelligent Technology Co ltd
Original Assignee
Nanjing Yunshe Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yunshe Intelligent Technology Co ltd filed Critical Nanjing Yunshe Intelligent Technology Co ltd
Priority to CN202210234485.5A priority Critical patent/CN114691826B/en
Publication of CN114691826A publication Critical patent/CN114691826A/en
Application granted granted Critical
Publication of CN114691826B publication Critical patent/CN114691826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a medical data information retrieval method based on co-occurrence analysis and spectral clustering, which is characterized by comprising the steps of carrying out query expansion on retrieval task words, classifying the expanded retrieval task words into retrieval words, expansion words and feature words, carrying out first scoring on documents, selecting the documents if the first scoring is greater than a threshold value T, and abandoning the documents if the first scoring is less than the threshold value T; performing secondary scoring and co-occurrence analysis on the selected documents to obtain secondary scores and co-occurrence scores, and calculating the comprehensive scores of the documents through the primary scores, the secondary scores and the co-occurrence scores; forming vectors for describing documents by using a bag-of-words model and a chemical word list, a medical subject word list, an abstract and a keyword list of the documents, clustering selected documents by using a vector distance matrix as the input of spectral clustering, and outputting clustering clusters; and outputting the class with the highest average comprehensive score as a retrieval result, and sorting and outputting the documents in the retrieval result according to the comprehensive score in a descending order.

Description

Medical data information retrieval method based on co-occurrence analysis and spectral clustering
Technical Field
The invention relates to the field of medical data information retrieval, in particular to a medical data information retrieval method based on co-occurrence analysis and spectral clustering.
Background
With the continuous development of science and technology, information on the internet is more and more abundant, and the way of acquiring information is more and more convenient, so that the network becomes an indispensable part of daily study and life of people, medical treatment also enters a big data era, and people can very easily acquire basic knowledge related to medical treatment, such as information of symptoms, treatment and prevention of diseases and the like, from the internet. Meanwhile, a plurality of on-line medical question-answer websites are developed, so that the on-line inquiry method is adopted instead of the on-site face-to-face examination between the patient and the doctor, the manpower, material resources and time are greatly saved, and the privacy of the patient is protected to a great extent. In addition, certain conventional decision-making work which needs a large number of repetitions can effectively improve efficiency, save cost and reduce errors by scientifically applying a computer medical information retrieval system. The reasonable application of computer technology can not only effectively improve the clinical service quality, but also greatly reduce the cost. Therefore, it is very important to develop a computer-aided medical information retrieval system.
In practice, every decision of a doctor is very important for a patient, so that the doctor needs to continuously learn and pay attention to the latest technology and method of clinical science. The authoritative literature and the latest research result of the medical field can be comprehensively consulted on the network, so that the medical search model plays a vital role. On the other hand, for medical workers, if a problem that a decision is difficult to be made is met for a certain medical record, searching related biomedical documents on the network as case reference and inspiration is an important way for solving the problem.
Precision Medicine (Precision Medicine) is a new medical concept and medical model developed based on individualized Medicine with rapid progress of genome sequencing technology and cross-application of biological information and big data science. The essence of the method is that through genome, proteome and other omics technologies and medical frontier technologies, analysis, identification, verification and application of biomarkers are carried out on large sample populations and specific disease types, so that the causes and treatment targets of diseases are accurately found, different states and processes of a disease are accurately classified, the purpose of carrying out personalized and accurate treatment on the disease and specific patients is finally realized, and the benefits of disease diagnosis and treatment and prevention are improved.
The emphasis of precision medical treatment is not on "medical treatment", but on "precision". Since the "human genome project preparation" of the historical mission of genetic science, much research work has been done on the concept of "genomics" by biosystemists. From the perspective of data flow in the whole life field, these are processes leading to "precision" from the earliest central principle (the whole process of information flow from DNA to protein), to system biology (the formation of information networks). Compared with individual medical treatment, the precise medical treatment pays more attention to the depth characteristics of 'diseases' and the high precision of 'medicines'; is a high-level medical technology formed on the basis of deep understanding of people, diseases and medicines.
The purpose of information retrieval is to retrieve documents relevant to a given query. Generally, the relevance of documents to a query is usually measured by a score given by an IR model, and ranked accordingly, such as the classical BM25 model. In the past decades, machine learning techniques have been applied to the field of information retrieval and have achieved great results.
The earliest starting machine learning algorithm was learning sequencing, which can be divided into three main categories: single document method, document pair method, document list method. Common single-document methods, such as logistic regression, use the feature vector of each document as input and output the relevance of each document. The document pairing method, such as RankSVM and RankBoost, inputs a pair of feature vectors of documents, and outputs a correlation comparison relation between the pair of documents. Document listing methods, such as ListNet, adaRank, and LambdaMart, input a set of documents associated with a query and output a ranked list. All learned ranking models focus on the best way to learn combinations of features through training, however, a successful learned ranking algorithm relies on efficient manual features for learning. The learning work of the features usually takes a lot of time and is heavy, which greatly hinders the further development of the learning and ranking algorithm.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide a medical data information retrieval method based on co-occurrence analysis and spectral clustering.
In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:
a medical data information retrieval method based on co-occurrence analysis and spectral clustering, the method comprising the steps of:
(1) Carrying out query expansion on the retrieval task vocabulary, classifying the expanded retrieval task vocabulary into retrieval words, expansion words and feature words, scoring the documents for the first time, selecting the documents if the score for the first time is greater than a threshold value T, and abandoning the documents if the score for the first time is less than the threshold value T;
(2) Performing secondary scoring and co-occurrence analysis on the selected documents to obtain secondary scores and co-occurrence scores, and calculating the comprehensive scores of the documents through the primary scores, the secondary scores and the co-occurrence scores;
(3) Forming a vector for describing the literature by using a word bag model and using a chemical word list, a medical subject word list, an abstract and a keyword list of the literature, clustering the selected literature by using a vector distance matrix as the input of spectral clustering, and outputting a clustering cluster;
(4) And outputting the class with the highest average comprehensive score as a retrieval result, and sorting and outputting the documents in the retrieval result according to the comprehensive score in a descending order.
Further, in the step (1), the first scoring method is that the first scoring of the document is carried out through a chemical word list, a medical subject word list, a summary and a keyword list to obtain a first Score Frist _ Score of the document;
wherein, the search word, the expansion word and the feature word are respectively marked with 3 points, 2 points and 1 point;
traversing the retrieval task words contained in the chemical word list, the medical subject word list, the abstract and the keyword list of the document, and if the retrieval task words are the retrieval words, adding 3 points to the Frist _ Score for the first time; if the expansion word is the expansion word, adding 2 points to the first Score Frist _ Score; if the feature words are the feature words, adding 1 to the Frist _ Score for the first time; and accumulating to obtain a first Score Frist _ Score.
Further, in the step (1), the threshold T is the third quartile after the first score and the duplication removal.
Further, in the step (2), the secondary scoring method is to calculate average scores of the chemical word list, the medical subject word list and the keyword list;
traversing the retrieval task vocabulary contained in the document chemical word list, and if the retrieval task vocabulary is a retrieval word, adding 3 points to the chemical word list; if the word is an expansion word, adding 2 points to the chemical word list; if the words are the feature words, adding 1 to the scores of the chemical word list; adding to obtain a chemical word list score; dividing the chemical word list score by the length of the chemical word list to obtain an average score;
then, the secondary Score Second _ Score is obtained by averaging the average scores of the chemical word list, the medical subject word list and the keyword list and adding word frequency scores; the term frequency score is the occurrence frequency of the search term and is 4 at most.
Further, in step (2), the Composite Score _ Score:
Figure BDA0003541648430000031
wherein Frist _ Score is a primary Score, alpha is a primary Score weight, second _ Score is a secondary Score, beta is a secondary Score weight, and gamma is a co-occurrence Score; if the search terms co-occur in the document abstract, the co-occurrence score is increased.
Further, in the step (3), the classification rule of the spectral clustering is as follows:
Figure BDA0003541648430000032
wherein, the Category _ Number is a classification Number, theta is a lower limit of a classification threshold, and eta is an upper limit of the classification threshold;
average composite score MMS (x) of the class with the highest average composite score among the classes:
MMS(x)x∈{one,two,three}
wherein One, two, three represents the number of classes, and MMS (Two) represents the group with higher average integrated score divided into 2 classes.
Further, in the step (3), the classification process of the spectral clustering comprises:
(1) Dividing the selected documents into 2 classes by using a spectral clustering algorithm, if the difference between the average comprehensive score of the class with higher average comprehensive score and the average comprehensive score of the initial sample is less than theta, indicating that the samples are more stable, and directly outputting results according to the sequence of the comprehensive scores from high to low;
(2) If the difference between the average comprehensive score of the class with the higher average comprehensive score and the average comprehensive score of the initial sample is larger than eta, the difference of the samples is obvious, and the group with the higher average comprehensive score in the class 2 is output;
(3) If the difference between the class with the higher average comprehensive score and the average comprehensive score of the initial sample belongs to [ theta, eta ], continuously using the spectral clustering algorithm to divide the selected documents into 3 classes; wherein theta is a lower limit of a classification threshold, and eta is an upper limit of the classification threshold;
(4) If the class classified into the 3 classes with the highest average composite score is higher than the class classified into the 2 classes with the highest average composite score, the group with the highest average composite score in the 3 classes is output, otherwise, the group with the higher average composite score in the 2 classes is output.
Further, in the step (1), a MeSH database is selected for query expansion of the retrieval task vocabulary.
Compared with the prior art, the method has the advantages that the method adopts an average scoring mode in the secondary scoring to avoid the over-high primary score of the irrelevant literature caused by the overlong text; then, a co-occurrence analysis method is adopted to further increase the score of the documents of diseases and gene (search term) co-occurrence in the search task and improve the precision ratio of the search; and finally, the result output by adopting spectral clustering is greatly improved compared with the result before clustering, which shows that the degree of the relevance area of the classified literature is obviously improved.
Drawings
FIG. 1 is a scoring rule for retrieving task words;
FIG. 2 is a flow chart of a medical data information retrieval method based on co-occurrence analysis and spectral clustering according to the present invention;
FIG. 3 is a score before and after clustering of the ranking spectra of the aggregate scored postamble;
FIG. 4 is a partial document keyword comparison before and after clustering;
FIG. 5 is a schematic before and after document clustering.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.
The medical data information retrieval method based on co-occurrence analysis and spectral clustering comprises the steps of firstly scoring a document through a medical subject word list, a chemical word list, a keyword list and an abstract, then avoiding overhigh score of a non-relevant document caused by overlong text in an average scoring mode, then further increasing scores of documents with co-occurrence of diseases and genes (retrieval words) in a retrieval task by adopting the co-occurrence analysis method, and finally clustering the documents by adopting the spectral clustering and outputting a group of documents with highest scores as a result of a retrieval model to be output.
Medical Subject Headings (MeSH) is a controlled vocabulary made by the national Medical library, primarily used to index, catalog, and search biomedical and health-related information and documents. The important role of MeSH in literature search is mainly manifested in two aspects: accuracy and specificity. The indexing personnel inputs the information into the retrieval system and the retriever utilizes the information in the system, takes the subject term as the standard term, so that the indexing and retrieval terms are consistent, and the best retrieval effect is achieved. The medical subject term is crucial to the retrieval of medical literature, and the comprehensiveness and accuracy of the term have important influence on the retrieval result.
In the invention, the extended MeSH is selected to be added into the document scoring model in the query extension, and the extension of the MeSH adopts a MeSH database manufactured by a national medical library.
The documents are scored first, and the document scoring method is three scoring including first scoring, second scoring and comprehensive scoring.
(1) Scoring for the first time;
the method for marking the document for the first time marks 3 scores (search words) for emerging diseases and genes, marks 2 scores (expansion words) for words obtained by adopting expansion queries, and marks 1 score (characteristic words) for describing the inherent characteristics of people such as age, sex and the like.
The scoring method is based on the following steps: the contents directly related to the literature describe diseases and genes, so that the scoring weight is large, the expanded medical subject words are indirectly related to the literature, the scoring weight is lower, inherent characteristics such as age and sex appear in many literatures, and in order to highlight the literature related to the query and distinguish the literature from irrelevant literature, the score of the inherent characteristics is set to be the lowest.
Taking the first task of 2017TREC Precision Medicine as an example, a tree-shaped scoring model is established. The retrieval tasks are shown in table 1, the MeSH extended medical subject terms are shown in table 2, all retrieval task terms are obtained, and the scoring rules are shown in fig. 1.
TABLE 1 2017TREC Precision Medicine retrieval task 1
disease gene demographic other
Liposarcoma CDK4Amplification 38-year-old male GERD
TABLE 2 2017TREC Precision Medicine retrieval task 1 expanded medical subject term
Figure BDA0003541648430000051
Figure BDA0003541648430000061
As shown in fig. 1, all the search task words are classified, including the search word, the expansion word and the feature word, and are respectively scored 3, 2 and 1. The first scoring method is that the document scoring is carried out once through a chemical word list, a medical subject word list, a summary and a keyword list to obtain a document scoring sum Frist _ Score.
Chemical list refers to a list of chemical words in the literature; meshHeadingList represents a list of medical subject terms in the literature; abstract represents the Abstract in the literature; keyword list in the literature is represented by keyword list; searchWordList represents the list of terms; expandWordList represents an extended word list; characteriostList represents a list of feature words. WordList is the union of chemical list, meshHeadingList, keywordList, and Abstract.
If the document WordList contains the retrieval task vocabulary and is SearchWordList, frist _ Score takes 3 points; if the document WordList contains the retrieval task vocabulary and is ExpandWordList, frist _ Score is scored by 2; if the document WordList contains the retrieval task words and is CharacteriostecList, 1 point is recorded in Frist _ Score; and accumulating the document scores Frist _ Score, traversing the document WordList, and comparing all the retrieval task words to obtain a total document Score Frist _ Score.
The first scoring method comprises the following steps:
Figure BDA0003541648430000062
(2) Secondary scoring;
after the first scoring is finished, through observation, the first scoring of some documents is higher due to the fact that the frequency of the occurrence of search words is too high and the first scoring of non-relevant documents is higher due to the fact that the frequency of the occurrence of the search words is too high in some documents which are irrelevant to the search subject due to the fact that the chemical word list, the medical subject word list and the abstract of some documents are too long, and therefore the final search result is interfered, and therefore the scores calculated by the chemical word list and the medical subject word list are divided by the lengths of the chemical word list and the medical subject word list; the keywords appearing in the abstract are calculated according to the word frequency density, and the upper limit of word frequency statistics is set to be 4 by considering the factor of text length.
The secondary scoring method is as follows:
chemical list means a list of chemical words in the literature, meshHeadingList means a list of medical subject words in the literature, abstract means Abstract, keyWordList means a list of keywords, searchWordList means a list of search words, expandWordList means a list of extension words, and charateristiclist means a list of feature words.
If the chemical list contains the retrieval task vocabulary and is SearchWordList, 3 points are marked for the chemical list _ Score; if the chemical list contains the retrieval task vocabulary and is ExpandWordList, then the chemical list _ Score is scored for 2 points; if the vocabulary of the search task is contained in the chemical list and is CharacteristicList, then the Score of 1 is marked on the chemical list _ Score; and accumulating the scores of the chemical list _ Score, traversing the literature chemical list, and comparing all the search task words to obtain the scores calculated by the chemical word list. The average score is obtained by dividing the score calculated for the list of chemical words by the length of the list of chemical words.
And similarly, obtaining the average scores of the medical subject word list and the keyword list.
Dividing chemical list _ Score, meshHeadingList _ Score and KeyWordList _ Score by the aggregate length of chemical list, meshHeadingList and KeyWordList, respectively, yields chemical list _ Mean _ Score, meshHeadingList _ Mean _ Score and KeyWordList _ Mean _ Score.
Then define Len, len is initialized to 0, if the values of the chemical list, the MeshHeadingList and the KeyWordList are greater than 0, then the value of Len is added with 1, so the Len minimum value is 0, the maximum value is 3, so the secondary Score Second _ Score = (chemical list _ Mean _ Score + MeshHeadingList _ Mean _ Score + KeyWordList _ Mean _ Score)/Len, for balancing the scores, balancing the scores of all documents to the same horizontal line.
And then adding the word frequency Score to obtain a final secondary Score Second _ Score. If the word frequency of the search word in the SearchWordList is more than 4, the word frequency score is 4, the situation that the search word appears too frequently to cause the individual literature score to be too high is prevented, the situation that the word frequency appears 4 times is considered to be frequent, the literature score is added by 4, the too frequent literature score is also counted by 4, and the situation that the final search result is influenced because the literature score is higher due to the too frequent word frequency is prevented.
The secondary scoring method comprises the following steps:
Figure BDA0003541648430000071
Figure BDA0003541648430000081
Figure BDA0003541648430000091
(3) Comprehensively scoring;
the comprehensive score is used as the final ranking score of the documents, the score of the documents with genes and diseases (search words) co-occurring in the abstract is increased in the comprehensive score, and the co-word analysis method utilizes the situation that word pairs or noun phrases in the document set commonly occur to determine the relationship among the subjects in the discipline represented by the document set.
The method is characterized in that a co-word analysis method is introduced into a scoring model of a document, if diseases and genes co-occur in a document abstract, the co-word Score gamma of the document is increased, and finally, the first Score and the second Score are multiplied by a certain weight and the co-occurrence Score is used as the final comprehensive Score Composite _ Score of the document.
The Composite Score-Score formula is as follows:
Figure BDA0003541648430000092
where α is the primary scoring weight, β is the secondary scoring weight, and γ is the co-occurrence score.
And after the comprehensive scores of the documents are obtained, clustering the documents by adopting spectral clustering and outputting a group of documents with the highest scores as the result of the retrieval model to be output.
The Bag-of-words Bag model was originally used in the field of information retrieval, and for a document, it is assumed that the order relation and syntax of words in the document are not considered, and only whether the word appears in the document is considered. The bag-of-words model is an effective method for changing a document into a word vector, so that the similarity between the documents is convenient to calculate, and the clustering operation of the documents is convenient to carry out. And clustering the screened documents by taking the vector distance matrix as the input of spectral clustering.
The spectral clustering algorithm is as follows:
inputting: n sample points x = { x 1 ,x 2 ,…,x n H, and the classification number k of the cluster;
and (3) outputting: cluster A 1 ,A 2 ,…,A k
The first step is as follows: calculating an n x n similarity matrix W using the following notations;
Figure BDA0003541648430000101
wherein W is s ij And forming a similarity matrix.
The second step is that: calculating a degree matrix D using the following formula;
Figure BDA0003541648430000102
the sum D of each row element of the similarity matrix W is D i Forming an n x n diagonal matrix.
The third step: calculating a Laplace matrix L = D-W;
the fourth step: calculating the characteristic value of L, sorting the characteristic values from small to large, and taking the characteristic vectors u of the first k characteristic values 1 ,u 2 ,…,u k
The fifth step: the top k column vectors composition matrix U = { U = { (U) } 1 ,u 2 ,…,u k },U∈R n*k
And a sixth step: let y i ∈R k Is the vector of line i of U, where i =1,2, \8230;, n;
the seventh step: new sample points Y = { Y using k-means algorithm 1 ,y 2 ,…,y n Cluster into clusters C 1 ,C 2 ,…,C k
The eighth step: output cluster A 1 ,A 2 ,…,A k Wherein A is i =(j|y j ∈C i }。
Dividing the documents into several categories can greatly affect the final sequencing result, setting the upper limit of the clustering as 3, firstly, classifying the screened documents into 2 categories by using a spectral clustering algorithm, if the category with higher average comprehensive score is slightly different from the average comprehensive score of the initial sample (< theta), indicating that the sample is relatively stable, directly outputting the results according to the sequence of the comprehensive scores from high to low, if the category with higher average comprehensive score is greatly different from the average comprehensive score of the initial sample (> eta), indicating that the sample is obviously different, classifying into the group with 2 categories which can already show better prominently, outputting the group with higher average comprehensive score in the 2 categories, if the category with higher average comprehensive score is slightly different from the average comprehensive score of the initial sample ([ theta, eta ]), continuously classifying the documents into 3 categories by using the spectral algorithm, if the category with higher average comprehensive score in the 3 categories is higher than the category with the highest average comprehensive score in the 2 categories, outputting the group with the highest average comprehensive score in the 3 categories, and otherwise classifying into the group with higher average comprehensive score.
The average composite score for the class of the defined classes with the highest average composite score is described as follows:
MMS(x)x∈{one,two,three} (4)
wherein One, two, three represents the number of classes, and MMS (Two) represents the group with higher average integrated score divided into 2 classes.
The classification rules are described as follows:
Figure BDA0003541648430000111
wherein, category _ Number is a Category Number, theta is a lower limit of a Category threshold, and eta is an upper limit of the Category threshold.
As shown in fig. 2, the medical data information retrieval method based on co-occurrence analysis and spectral clustering according to the present invention specifically comprises:
firstly, decomposing a retrieval task vocabulary into retrieval words, expansion words and characteristic words by adopting an expansion query method, then scoring documents according to the first scoring, screening out a first retrieval result according to a threshold T, if the first score is greater than the threshold T, selecting the documents, if the first score is smaller than the threshold T, abandoning the documents, then calculating secondary scores and comprehensive scores of the selected documents, sequencing according to the comprehensive scores, then forming vectors for describing the documents by using a word bag model and a chemical word list, a medical subject word list and a key word list of the documents so as to calculate the similarity among the documents, clustering the screened documents by using a vector distance matrix as the input of spectral clustering, finally outputting the class with the highest average comprehensive score as the retrieval result, and sequencing and outputting the selected documents according to the comprehensive scores in a descending order.
All experimental data are from medical literature in TREC 2017 at the precision medical task, each literature is expressed in xml format and has a unique ID number PMID. Because of the semi-structured characteristic of the XMl format, mongoDB is adopted as a database stored in documents, and python is selected as a programming language. Algorithm parameter settings are shown in table 3.
TABLE 3 Algorithm parameter set Table
Figure BDA0003541648430000112
TREC 2017 recommended tasks are used. Each result exhibited the top 25 documents with the highest scores as the final output result, which was analyzed as the top 10 ranked. Taking the search task 4 as an example, the scores before and after the document sorting spectrum clustering after the comprehensive scoring are performed are shown in fig. 3.
In FIG. 3 PMID represents the unique identification number of the medical document, and 0,1,2,N represents that the document is irrelevant, partially relevant, and out of query scope to the search task. It can be seen from fig. 3 that the output result after spectral clustering is greatly improved compared with the result before clustering, which indicates that the degree of the relevance area of the classified documents is obviously improved. Partial document keyword pairs before and after clustering are shown in fig. 4, and the word pairs before and after clustering are shown in fig. 5.
Compared with the experiment performed by the literature with 2017TREC accurate medical tasks, three groups of indexes (infNDCG, R-prec and P @ 10) are selected for comparison, and the experimental results are shown in Table 4.
Table 4 experimental comparison results
Comparison method InfNDCG R-Prec P@10
cbnuSA 0.3218 0.2287 0.4614
WSU-IR 0.3853 0.2682 0.5937
SIBTMlit 0.4180 0.2690 0.5500
INTGR 0.4021 0.2739 0.6010
UTDHLTFF 0.4593 0.2987 0.6172
BMP 0.4929 0.3225 0.6323
The method of the invention 0.5100* 0.3653* 0.6467*
Wherein infNDCG (deducing normalized break-up cumulative gain) estimates NDCG (normalized break-up cumulative gain) by combining hierarchical correlation decisions with missing values using a sampling technique. NDCG is derived by normalizing the Discounted Cumulative Gain (DCG), a measure obtained by discounting the total cumulative relevance by the position of the document in the ranking list.
P @10, is the proportion of relevant documents in the top 10 results.
R-prec, which is defined as the accuracy of the R-th document given a query of R related documents.
The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are merely preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for limiting the scope of the present invention, and on the contrary, any improvement or modification made based on the spirit of the present invention should fall within the scope of the present invention.

Claims (4)

1. A medical data information retrieval method based on co-occurrence analysis and spectral clustering, the method comprising the steps of:
(1) Carrying out query expansion on the retrieval task vocabulary, classifying the expanded retrieval task vocabulary into retrieval words, expansion words and feature words, scoring the documents for the first time, selecting the documents if the score for the first time is greater than a threshold value T, and abandoning the documents if the score for the first time is less than the threshold value T;
(2) Performing secondary scoring and co-occurrence analysis on the selected documents to obtain secondary scores and co-occurrence scores, and calculating the comprehensive scores of the documents through the primary scores, the secondary scores and the co-occurrence scores;
the secondary scoring method comprises the steps of calculating the average scores of a chemical word list, a medical subject word list and a keyword list;
traversing the retrieval task vocabulary contained in the document chemical word list, and if the retrieval task vocabulary is a retrieval word, adding 3 points to the chemical word list; if the word is an expansion word, adding 2 points to the chemical word list; if the words are the feature words, adding 1 to the scores of the chemical word list; accumulating to obtain a chemical word list score; dividing the chemical word list score by the length of the chemical word list to obtain an average score of the chemical word list;
the secondary Score Second _ Score is obtained by averaging the average scores of the chemical word list, the medical subject word list and the keyword list and adding word frequency scores; the term frequency score is the occurrence frequency of the search term and is 4 at most;
wherein, the Composite Score Composite _ Score:
Figure FDA0003928880630000011
wherein Frist _ Score is a primary Score, α is a primary Score weight, second _ Score is a secondary Score, β is a secondary Score weight, and γ is a co-occurrence Score; if the search terms co-occur in the document abstract, increasing a co-occurrence score;
(3) Forming a vector for describing the literature by using a word bag model and using a chemical word list, a medical subject word list, an abstract and a keyword list of the literature, clustering the selected literature by using a vector distance matrix as the input of spectral clustering, and outputting a clustering cluster;
the classification rule of spectral clustering is as follows:
Figure FDA0003928880630000012
wherein, the Category _ Number is a classification Number, theta is a lower limit of a classification threshold, and eta is an upper limit of the classification threshold;
average composite score MMS (x) of the class with the highest average composite score among the classes:
MMS(x) x∈{one,two,three}
wherein, one, two, three represents the number of classes, MMS (Two) represents the group with higher average integrated score divided into 2 classes;
the classification process of spectral clustering comprises the following steps:
(3.1) dividing the selected documents into 2 classes by using a spectral clustering algorithm, if the class with higher average comprehensive score is different from the average comprehensive score of the initial sample by < theta, indicating that the sample is more stable, and directly outputting results in sequence from high to low according to the comprehensive score;
(3.2) if the average comprehensive score of the class with the higher average comprehensive score is different from the average comprehensive score of the initial sample by > eta, indicating that the sample difference is obvious, outputting the class with the higher average comprehensive score in the class 2;
(3.3) if the class with the higher average comprehensive score is different from the average comprehensive score of the initial sample by [ theta, eta ], continuously using a spectral clustering algorithm to divide the selected documents into 3 classes; wherein theta is a lower classification threshold, and eta is an upper classification threshold;
(3.4) if the class classified into the 3 classes with the highest average composite score is higher than the class classified into the 2 classes with the highest average composite score, outputting the group with the highest average composite score in the 3 classes, otherwise, outputting the group classified into the 2 classes with the higher average composite score;
(4) And outputting the class with the highest average comprehensive score as a retrieval result, and sorting and outputting the documents in the retrieval result according to the comprehensive score in a descending order.
2. The medical data information retrieval method based on co-occurrence analysis and spectral clustering as claimed in claim 1, wherein in step (1), the first scoring is performed by first scoring the document through a chemical word list, a medical subject word list, an abstract and a keyword list to obtain a first Score Frist _ Score of the document;
wherein, the search term, the expansion term and the feature term are respectively marked with 3 points, 2 points and 1 point;
traversing the retrieval task words contained in the document chemical word list, the medical subject word list, the abstract and the keyword list, and if the retrieval words are the first-time scores Frist _ Score plus 3; if the expansion word is the expansion word, adding 2 points to the first Score Frist _ Score; if the feature words are the feature words, adding 1 to the Frist _ Score for the first time; the first Score Frist Score is obtained by accumulation.
3. The medical data information retrieval method based on co-occurrence analysis and spectral clustering according to claim 1, wherein in the step (1), the threshold T is a third quartile after the first score is de-duplicated.
4. The medical data information retrieval method based on co-occurrence analysis and spectral clustering as claimed in claim 1, wherein in the step (1), the MeSH database is selected for the query expansion of the retrieval task vocabulary.
CN202210234485.5A 2022-03-10 2022-03-10 Medical data information retrieval method based on co-occurrence analysis and spectral clustering Active CN114691826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210234485.5A CN114691826B (en) 2022-03-10 2022-03-10 Medical data information retrieval method based on co-occurrence analysis and spectral clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210234485.5A CN114691826B (en) 2022-03-10 2022-03-10 Medical data information retrieval method based on co-occurrence analysis and spectral clustering

Publications (2)

Publication Number Publication Date
CN114691826A CN114691826A (en) 2022-07-01
CN114691826B true CN114691826B (en) 2022-12-09

Family

ID=82139768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210234485.5A Active CN114691826B (en) 2022-03-10 2022-03-10 Medical data information retrieval method based on co-occurrence analysis and spectral clustering

Country Status (1)

Country Link
CN (1) CN114691826B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659047B (en) * 2022-11-11 2023-07-28 南京汇宁桀信息科技有限公司 Medical document retrieval method based on hybrid algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763402A (en) * 2009-12-30 2010-06-30 哈尔滨工业大学 Integrated retrieval method for multi-language information retrieval
CN106933806A (en) * 2017-03-15 2017-07-07 北京大数医达科技有限公司 The determination method and apparatus of medical synonym
CN113436698A (en) * 2021-08-27 2021-09-24 之江实验室 Automatic medical term standardization system and method integrating self-supervision and active learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7519605B2 (en) * 2001-05-09 2009-04-14 Agilent Technologies, Inc. Systems, methods and computer readable media for performing a domain-specific metasearch, and visualizing search results therefrom
US20210233658A1 (en) * 2020-01-23 2021-07-29 Babylon Partners Limited Identifying Relevant Medical Data for Facilitating Accurate Medical Diagnosis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763402A (en) * 2009-12-30 2010-06-30 哈尔滨工业大学 Integrated retrieval method for multi-language information retrieval
CN106933806A (en) * 2017-03-15 2017-07-07 北京大数医达科技有限公司 The determination method and apparatus of medical synonym
CN113436698A (en) * 2021-08-27 2021-09-24 之江实验室 Automatic medical term standardization system and method integrating self-supervision and active learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
2015年~2019年我国医疗大数据领域研究热点分析;苗豆等;《医学信息》;20210515;第34卷(第10期);1-4 *
an improved BM25 algorithm for clinical decision support in precision medicine based on co-word analysis and cuckoo search;Zhang zicheng;《BMC medical infromations and decision making》;20211231;1-15 *

Also Published As

Publication number Publication date
CN114691826A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
US8903825B2 (en) Semiotic indexing of digital resources
US7660709B2 (en) Bioinformatics research and analysis system and methods associated therewith
CN109036577B (en) Diabetes complication analysis method and device
CA2796061C (en) Ascribing actionable attributes to data that describes a personal identity
CN108121896B (en) Disease relation analysis method and device based on miRNA
CN113282689B (en) Retrieval method and device based on domain knowledge graph
Glenisson et al. Evaluation of the vector space representation in text-based gene clustering
Cao et al. Multi-information source hin for medical concept embedding
Chen et al. Document triage and relation extraction for protein-protein interactions affected by mutations
CN114691826B (en) Medical data information retrieval method based on co-occurrence analysis and spectral clustering
CN110399493A (en) A kind of author&#39;s disambiguation method based on incremental learning
Rak et al. Multi-label associative classification of medical documents from medline
Nikiforovskaya et al. Automatic generation of reviews of scientific papers
Daoud et al. York University at TREC 2011: Medical Records Track.
Miotto et al. Supporting the Curation of Biological Databases Reusable Text Mining
Sharmila et al. Chronological pattern exploration algorithm for gene expression data clustering and classification
Al-Mubaid et al. A text-mining technique for extracting gene-disease associations from the biomedical literature
Zelina et al. Unsupervised extraction, labelling and clustering of segments from clinical notes
CN113946647A (en) DDIs (distributed denial of service) search engine based on medical entity vector and construction method thereof
Al-Omari Evaluating the effect of stemming in clustering of Arabic documents
Chitode et al. A comparative study of microarray data analysis for cancer classification
Struble et al. Clustering MeSH representations of biomedical literature
Ebrahimi et al. Analysis of Persian Bioinformatics Research with Topic Modeling
Chen et al. Building a training dataset for classification under a cost limitation
CN115659047B (en) Medical document retrieval method based on hybrid algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant