CN112270178A

CN112270178A - Medical literature cluster theme determination method and device, electronic equipment and storage medium

Info

Publication number: CN112270178A
Application number: CN202011152154.4A
Authority: CN
Inventors: 柴玲
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-26
Anticipated expiration: 2040-10-23
Also published as: CN112270178B; WO2021189920A1

Abstract

The application relates to the technical field of medical science and technology, and particularly discloses a method and a device for determining a topic of a medical literature cluster, electronic equipment and a storage medium. The method comprises the following steps: clustering a plurality of medical documents to obtain at least one medical document cluster; determining a target medical document in each of the at least one medical document cluster; determining a candidate phrase set corresponding to each medical document cluster; and determining a theme corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster.

Description

Medical literature cluster theme determination method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of text recognition, in particular to a method and a device for determining a topic of a medical literature cluster, electronic equipment and a storage medium.

Background

Public medical (public medical science, PUBMED) database contains a large amount of medical documents, the development trend of the research direction of a certain medical field is often contained in the large amount of medical documents, and the efficiency and the precision of decision making by researchers in the related field and by reading the medical documents in the medical field can be improved.

In order to improve the efficiency of searching and reading medical documents, the correlation among the medical documents can be mined by using a natural language processing technology, and a plurality of medical document clusters are obtained by clustering a plurality of medical documents based on the correlation among the medical documents, so that the medical documents with the large number can be divided into the medical document clusters, and a reader can search the medical document cluster which the reader wants to obtain from the plurality of document clusters according to the theme of each medical document cluster.

At present, after medical documents are clustered, because the medical documents have strong speciality, experts are often required to manually label topics for each medical document cluster, so that the cost for labeling the topics of the medical documents is high, and the labeling efficiency is low.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining a subject of a medical literature cluster, electronic equipment and a storage medium. The efficiency and the precision of marking the subjects of the medical literature cluster are improved.

In a first aspect, an embodiment of the present application provides a method for determining a topic of a medical document cluster, including:

clustering a plurality of medical documents to obtain at least one medical document cluster;

determining a target medical document in each of the at least one medical document cluster;

determining a candidate phrase set corresponding to each medical document cluster;

and determining a theme corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster.

In a second aspect, an embodiment of the present application provides an apparatus for determining a topic of a medical document cluster, including:

an acquisition unit for acquiring a plurality of medical documents;

the processing unit is used for clustering the plurality of medical documents to obtain at least one medical document cluster;

the processing unit is further configured to determine a target medical document in each of the at least one medical document cluster;

the processing unit is further configured to determine a candidate phrase set corresponding to each medical document cluster;

the processing unit is further configured to determine a topic corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor coupled to a memory, the memory configured to store a computer program, the processor configured to execute the computer program stored in the memory to cause the electronic device to perform the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, where the computer program makes a computer execute the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.

The embodiment of the application has the following beneficial effects:

it can be seen that, in the embodiment of the application, a target medical document and a candidate phrase set are determined from each document cluster, and then a topic corresponding to each medical document cluster is determined according to the target medical document and the candidate phrase set corresponding to each document cluster, so that the topic of the medical document cluster does not need to be manually marked, and the marking efficiency and the marking accuracy of the topic of the medical document cluster are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for determining a topic of a medical literature cluster according to an embodiment of the present application;

fig. 2 is a schematic diagram of a directed graph provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a process for determining a score of a medical document according to an embodiment of the present application;

fig. 4 is a block diagram of functional units of a topic determination apparatus for medical document clusters provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a topic determination apparatus for a medical document cluster according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for determining a topic of a medical literature cluster according to an embodiment of the present application. The method comprises the following steps:

101: and clustering the plurality of medical documents to obtain at least one medical document cluster.

The medical documents may be medical documents related to a disease in a PUBMED database, for example, the medical documents may be medical documents related to lung cancer, stomach cancer and tumor.

For example, the medical documents may be clustered according to similarity between topics of the medical documents to obtain the at least one medical document cluster, i.e., the medical documents with similar topics in the medical documents are classified into only one document cluster. For example, multiple medical documents may be clustered according to semantic similarity between co-quotes and topics among the multiple medical documents. The Clustering algorithm used for Clustering may be K-means Clustering algorithm, maximum Expectation Clustering algorithm EM (English), Hierarchical Clustering algorithm HAC (English), etc.

102: a target medical document in each of the at least one medical document cluster is determined.

Illustratively, acquiring the citation relation among medical literature in each medical literature cluster; determining the scores of the medical documents in each medical document cluster according to the citation relation among the medical documents in each medical document cluster, wherein the score of each medical document is used for expressing the importance degree of each medical document, namely the quality of the medical document; then, the target documents in each medical document cluster are determined according to the order of the scores from large to small. Illustratively, a preset proportion of documents in each medical document cluster can be selected as the target documents according to the order of the scores from large to small. For example, if the number of medical documents in a certain medical document cluster is 100 and the preset ratio is 10%, the first ten medical documents in the 100 medical documents are selected as the target medical documents of the medical document cluster in descending order of score.

Specifically, determining a directed graph corresponding to each medical literature cluster according to the citation relation among the medical literatures in each medical literature cluster; and determining the score of each node in the directed graph according to the directed graph corresponding to each medical document cluster and the pagerank algorithm, so that the score of each medical document in each medical document cluster can be obtained, namely the score of the medical document is determined according to the path between each medical document and other medical documents. In addition, an adjacent matrix corresponding to the medical literature cluster can be determined based on the directed graph, and the score of each medical literature cluster can be determined according to the adjacent matrix.

For example, the medical literature cluster includes a medical literature a, a medical literature B, and a medical literature C, and the medical literature B cites the medical literature a and the medical literature C cites the medical literature B, and a directed graph as shown in fig. 2 can be established. According to the pagerank algorithm and the directed graph, scores corresponding to the medical literature A, the medical literature B and the medical literature C can be determined respectively. In the case where the score of each medical document is determined by the adjacency matrix, the score corresponding to the medical document a is the sum of the scores between the medical documents a to B and the scores between the medical documents a to C. And the score corresponding to the medical document a can be represented by formula (1):

S＝1*γ+1*γ²formula (1);

s is the corresponding score of the medical document A, 1 gamma represents the score between the medical document A and the medical document B, and 1 gamma²And the score from the medical literature A to the medical literature C is shown, gamma is a preset hyper-parameter, and 0 < gamma < 1.

Illustratively, the score of each medical document may also be determined by combining the pagerank algorithm with the adjacency matrix, and this way of combining the two to determine the score of each medical document is described in detail below.

103: determining a set of candidate phrases corresponding to each of the medical document clusters.

Illustratively, phrases corresponding to the medical documents in each medical document cluster are determined according to titles and abstracts of the medical documents in each medical document cluster, that is, keywords are extracted from the titles and abstracts of the medical documents in each medical document cluster to obtain the phrases corresponding to the medical documents, for example, the phrases of each medical document can be annotated from each medical document by a language processing kit stanford NLP; then, phrases corresponding to all medical documents in each medical document cluster are combined into a first phrase set, and the first phrase set is screened to obtain a candidate phrase set corresponding to each medical document cluster.

Illustratively, for medical documents, there may be many instances of abbreviations, so that abbreviated phrases in the first set of phrases may be mapped to full names, resulting in a second set of phrases. For example, the abbreviated phrases appearing in the first phrase set can be detected by the abbreviated detection algorithm in the scispacy toolkit, and the abbreviated phrases can be mapped to full names, e.g., "NSCLC" can be mapped to "Non-small cell long cancer".

Further, the phrases in the second phrase set are cleaned, for example, the phrases in the second phrase set that only contain one word may be deleted, resulting in a third phrase set. Because a phrase consisting of a word is likely to be a universal vocabulary in the medical field that is meaningless for topic determination, for example, a word is a universal vocabulary in the medical field, but the word does not reflect the intrinsic characteristics of each medical document; in addition, the semantics contained in the phrase of one word are limited, and the characteristics of each medical document are difficult to express, so that the phrase needs to be cleaned from the second phrase set;

further, phrases with the same semantic meaning in the third phrase set are determined, the phrases with the same semantic meaning in the third phrase set are replaced by standardized phrases, a fourth phrase set is obtained, and the fourth phrase set is used as a candidate phrase set corresponding to each medical document cluster. I.e. the terms with the same semantic meaning are replaced by a standardized term corresponding to the semantic meaning. For example, the phrase "lung cancer survival rate" and the phrase "survival rate of lung cancer" have the same meaning and both express "survival rate of lung cancer", and the standardized phrase "survival rate of lung cancer" corresponds to "lung cancer survival rate", and the phrase "lung cancer survival rate" and the phrase "survival rate of lung cancer" are replaced with the phrase "survival cancer survival rate", and one standardized phrase is used to replace the original two phrases. Wherein, the standardized phrase corresponding to each semantic can be manually set in advance.

In practical applications, of course, when the phrases with the same semantic meaning appear in the third phrase set, one of the phrases with the same semantic meaning may be randomly reserved, and the other phrases may be deleted from the third phrase set to obtain the fourth phrase set, so that each semantic meaning in the fourth phrase set may only correspond to one phrase.

It can be seen that the phrases with the same semantics are replaced or deleted, so that the phrases with the same semantics can be prevented from appearing in the candidate phrase set, the situation that the topics of the medical document cluster are represented by the phrases with the same semantics is avoided, and the semantic richness of the topics of the medical document cluster is improved.

104: and determining a theme corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster.

Illustratively, word embedding may be performed on the title of the target medical document in each medical document cluster to obtain a first feature vector corresponding to each medical document cluster. The word embedding of the title of the target medical literature in each medical literature cluster can be realized through a Biobert model which completes training, the Biobert model is obtained by training the medical literature in the medical field as a training corpus, so that the Biobert model can more accurately process the language in the medical field and can accurately extract the semantic meaning of the medical literature, wherein the training of the Biobert model can be obtained through a supervised mode, and the description is omitted.

It should be understood that, in the case that the number of the target medical documents is one, the feature vector obtained by word embedding the title of the target medical documents is used as the first feature vector; under the condition that the number of the target medical documents is multiple, word embedding can be carried out on the titles of each piece of target medical documents to obtain the feature vector corresponding to each piece of target medical documents, and then the first feature vector is obtained after the feature vectors corresponding to the pieces of target medical documents are averaged according to positions.

For example, if the target medical document includes a medical document a and a medical document B, and the feature vectors of the medical document a and the medical document B are [0.1,0.3,0.5,0.7], [0.3,0.3,0.7,0.9], respectively, the bit-wise averaging results in a first feature vector of [0.2,0.3,0.6,0.8 ].

Further, performing word embedding on each phrase in the candidate phrase set to obtain a second feature vector of each phrase in the candidate phrase set, wherein performing word embedding on each phrase can also be realized through the Biobert model and is not described; then, performing word embedding on each word of each phrase in the candidate phrase set to obtain a third feature vector corresponding to each word; and determining a fourth feature vector corresponding to each phrase in the candidate phrase set according to the third feature vector corresponding to each word, namely, bit-averaging the third feature vectors corresponding to each word in each phrase, and using the feature vectors obtained by bit-averaging as the fourth feature vectors corresponding to each phrase. For example, four word words in the phrase "long cancer survival rate" are respectively word-embedded to obtain four feature vectors, and the four feature vectors are averaged according to bits to obtain a second feature vector corresponding to the phrase.

Further, a term frequency-inverse text frequency (TF-IDF) is determined for each phrase in the set of candidate phrases. Wherein the TF-IDF of each phrase is the product of the Term Frequency (TF) of the phrase and the inverse text frequency (IDF) of the phrase. Illustratively, the number of times each phrase appears in the medical document cluster in the candidate phrase set corresponding to each medical document cluster is obtained, and the ratio of the number of times each phrase appears in the medical document cluster to the total number of medical documents in the medical document cluster is used as the TF of each phrase. Thus, the TF for each phrase in the set of candidate phrases can be represented by equation (2):

wherein, TF_phrAs the word frequency of the phrase, D_contain-phrIs the number of occurrences of the phrase in the medical document cluster, D_clusterThe total number of medical documents in the medical document cluster.

Illustratively, the IDF of each phrase in the set of candidate phrases may be represented by equation (3):

finally, determining a theme corresponding to each medical document cluster according to the first feature vector corresponding to each medical document cluster, the second feature vector corresponding to each phrase in the candidate phrase set, the fourth feature vector corresponding to each phrase in the candidate phrase set and the TF-IDF of each phrase in the candidate phrase set.

Illustratively, determining a first similarity between a first feature vector corresponding to each medical document cluster and a second feature vector corresponding to each phrase in the candidate phrase set; determining a second similarity between the first feature vector corresponding to each medical document cluster and a fourth feature vector corresponding to each phrase in the candidate phrase set; finally, a third similarity between each medical document cluster and each phrase in the candidate phrase set is determined according to the first similarity, the second similarity and the TF-IDF corresponding to each phrase. For example, the first similarity, the second similarity, and the TF-IDF may be weighted to obtain the third similarity.

Illustratively, the similarity may be cosine similarity between vectors. Therefore, the third similarity can be expressed by equation (4):

sim(phr,cluster)＝β*cos_sim(vec₁,vec₂)+(1-β)*cos_sim(vec₁,vec₄) + (1- β) TF-IDF formula (4);

where sim (phr) is the third degree of similarity, cos, between the medical document cluster and each phrase_simFor operation of finding cosine similarity, vec₁A first feature vector, vec, corresponding to a cluster of medical documents₂A second feature vector, vec, corresponding to each phrase in the set of candidate phrases₄And beta is a preset parameter, and beta is more than or equal to 0 and less than or equal to 1.

Then, a fourth similarity between any two phrases in the candidate phrase set is determined according to the second feature vector of each phrase in the candidate phrase set. For example, the fourth similarity may also be a pre-similarity, and therefore, the fourth similarity may be represented by formula (5):

sim(phr₁,phr₂)＝cos_sim(vec₂₁,vec₂₂) Formula (5);

wherein, phr₁Is a phrase in the candidate phrase set, phr₂Is another phrase in the candidate phrase set, sim (phr)₁,phr₂) Is a fourth similarity, vec, between the two candidate phrases in the candidate set₂₁A second feature vector, vec, corresponding to a phrase in the set of candidate phrases₂₂A second feature vector corresponding to another phrase in the candidate phrase set.

And finally, determining the corresponding theme of each medical document cluster according to the third similarity between each medical document cluster and each phrase in the candidate phrase set and the fourth similarity between any two phrases in the candidate phrase set.

Exemplarily, the phrase with the maximum third similarity in the candidate set is used as a target phrase, and the target phrase is moved from the candidate phrase set to the target phrase set; then, determining a maximum boundary correlation (MMR) score corresponding to each phrase in the remaining phrases according to a third similarity between each phrase in the remaining phrases of the candidate phrase set and the medical document cluster and a second similarity between each phrase in the target phrase set, for example, obtaining a fifth similarity corresponding to each target phrase in the target phrase set according to the third similarity between each phrase in the remaining phrases and the medical document cluster and the second similarity between each phrase in the target phrase set, and using the maximum fifth similarity as the MMR score of each phrase in the remaining phrases; then, the phrase with the largest MMR score among the remaining phrases is moved from the candidate phrase set to the target phrase set. And finally, determining the MMR score corresponding to each phrase in the remaining phrases of the candidate phrase set again, moving the phrase with the maximum MMR score in the remaining phrases to a target phrase set, sequentially iterating until the number of the target phrases in the target phrase set reaches a preset number, stopping iteration, and taking the target phrases in the target phrase set as the subjects of each medical document cluster.

Illustratively, the MMR score for each of the remaining phrases can be represented by equation (6):

wherein PHR represents a candidate phrase set corresponding to each medical document cluster, K is a target phrase set, PHR_iE.g., PHR \ K represents the ith phrase belonging to the candidate phrase set but not belonging to the target phrase set, i.e., the ith phrase in the remaining phrases, MMR_iIs the MMR score, phr, of the ith phrase_je.K denotes the jth phrase in the target set of phrases, sim (phr)_iCluster) is a third similarity between the ith phrase and the medical document cluster,

for the fourth similarity between the ith phrase and the jth phrase, argmax represents the maximum value, i.e., after traversing the target phrases in the target phrase set, the maximum value is taken as the MMR score of the ith phraseThe value, α, is a preset parameter. Finally, after traversing each of the remaining phrases, the MMR score for each of the remaining phrases can be obtained.

For example, if a candidate phrase set of a certain medical document cluster includes a phrase a, a phrase B, a phrase C, a phrase D, and a phrase E, and the third similarity between the phrase a and the medical document cluster is the maximum, the phrase a is first used as a target phrase, and the phrase a is moved from the candidate phrase set to the target phrase set, where the remaining phrases of the candidate phrase set include the phrase B, the phrase C, the phrase D, and the phrase E; then, calculating the MMR score of each phrase in the remaining phrases, namely substituting a third similarity between each phrase and the medical document cluster and a second similarity between each phrase and the phrase A into the formula (6) to respectively obtain the MMR scores corresponding to the phrase B, the phrase C, the phrase D and the phrase E; and if the MMR score of the phrase B is the maximum, moving the phrase B from the candidate set to the target phrase set, wherein the remaining phrases in the candidate phrase set comprise a phrase C, a phrase D and a phrase E. And finally, substituting a third similarity between each phrase in the remaining phrases and the medical document cluster and a second similarity between each phrase and the phrase A into the formula (6) to obtain a similarity corresponding to the phrase A, substituting the third similarity between the phrase and the medical document cluster and the second similarity between the phrase and the phrase B into the formula (6) to obtain a similarity corresponding to the phrase B, and taking the maximum similarity of the two similarities as the MMR score of the phrase. Determining the MMR score of each of the remaining phrases in turn, the MMR scores of phrase C, phrase D, and phrase E can be obtained. Assuming that the MMR score for phrase C is the greatest, phrase C is moved from the candidate set to the target phrase set. If the preset number is three phrases, at the time when three phrases are already in the target phrase set, the iteration is stopped, and the phrase A, the phrase B and the phrase C are taken as the subjects of the medical document cluster.

It can be seen that in the process of calculating the similarity between each phrase and the first feature vector of the medical document cluster (i.e. the subject feature characterizing the medical document cluster), the first similarity and the second similarity between each phrase and the medical document cluster are calculated respectively by using the second feature vector of each phrase (i.e. the semantic feature of the phrase) and the fourth feature vector (i.e. the part-of-speech feature of the word). Each phrase is segmented, and a second similarity between each phrase and the first feature vector of the medical document cluster is determined at a word granularity. The method mainly avoids the problem that some phrases are long and are not related to the subjects of the medical document cluster, but because the phrases are long, words related to the subjects may be frequently contained, but the words may be words in the medical field, such as words, which may be affected by high-frequency words, in the process of extracting semantic features of the long phrases, the semantic features of the long phrases are related to the subjects of the medical document cluster, and the long and short phrases are mistakenly used as the subjects of the medical document cluster, so that the precision of the subjects of the extracted document cluster is low. And by segmenting each phrase, starting from each word and not considering the context of the word, the words which are not related to the subject but appear frequently are classified into general words, and the second similarity obtained in the process of calculating the second similarity is smaller, so that the third similarity obtained after weighting is also smaller, the phrase is not taken as the subject of the medical document cluster, and the finally extracted subject of the medical document cluster is more accurate.

In an embodiment of the present application, the subject determination method of the medical literature cluster of the present application can also be applied to the field of smart medical technology. For example, by the method for determining the topics of the medical document clusters, the topic of each medical document cluster can be quickly and accurately marked, so that a doctor can accurately inquire the medical document cluster which the doctor wants to acquire, relevant document references are provided for the doctor, the diagnosis efficiency of the doctor is improved, and the development of medical science and technology is further promoted.

Referring to fig. 3, fig. 3 is a schematic flow chart illustrating a process of determining a score of each medical document according to an embodiment of the present application. The method comprises the following steps:

301: citation relationships between the plurality of medical documents in each medical document cluster are obtained.

302: and determining the corresponding directed graph of the plurality of medical documents according to the citation relation of the plurality of medical documents.

303: determining a score for each of the plurality of medical documents based on the corresponding directed graph of the plurality of medical documents and the publication time of each of the plurality of medical documents.

Illustratively, a directed graph corresponding to the plurality of medical documents and a pagerank algorithm determine a first score for each of the plurality of medical documents.

Specifically, similar to the method for determining the importance of the web page, the transfer matrices corresponding to the multiple medical documents are determined according to the directed graph (i.e., the reference relationship among the multiple medical documents and the connection relationship of the similar web pages); then, determining the initial probability of each medical literature according to the number of the plurality of medical literatures, namely the initial probability of each medical literature is 1/N, and N is the number of the plurality of medical literatures; and performing multiple iterations according to the initial probability, the transition matrix and the preset hyperparameter to obtain a first score of each medical document, wherein the first score can also reflect the quality of each medical document.

Illustratively, the first scores of the plurality of medical documents are normalized to obtain a second score corresponding to each medical document in the plurality of medical documents; and then, obtaining a score corresponding to the medical literature i according to the directed graph and the second score corresponding to each medical literature.

Illustratively, determining a third score of each medical document in the other medical documents except the medical document i in the plurality of medical documents for the medical document i according to the directed graph and the second score of each medical document; and summing the third score of the medical document i and the second score of the medical document i by each medical document in the other medical documents to obtain the score corresponding to the medical document i.

Specifically, according to the directed graph, determining a medical document that refers to the medical document i and a medical document that does not refer to the medical document i (i.e., an isolated node in the directed graph) in the other medical documents, where the medical document that refers to the medical document i includes a medical document B that directly refers to the medical document a and a medical document C that indirectly refers to the medical document a, for example, as shown in fig. 2; determining a third score of a medical document j to the medical document i according to a second score and publication time of the medical document j, a second score of the medical document i and the preset time node, wherein the medical document j refers to any one of the medical documents i, the value of j is 1 to M, and M is the number of the medical documents which refer to the medical document i; a third score of 0 was determined for a medical document not referring to the medical document i.

For example, in the case where the medical document j directly references the medical document i, a first mean value between the medical document j and the second score of the medical document i and a first time difference between the publication time of the medical document j and the publication time of the medical document i may be determined; and determining a third score of the medical document j to the medical document i according to the first mean value and the first time difference.

Illustratively, in a case where a medical document j indirectly refers to the medical document i, three medical documents are used for illustration, for example, the medical document j directly refers to a medical document k (does not refer to the medical document i), and the medical document k directly refers to the medical document i, then a third score of the medical document j to the medical document k and a third score of the medical document k to the medical document i can be determined, and a product of the third score of the medical document j to the medical document k and the third score of the medical document k to the medical document i can be used as the third score of the medical document j to the medical document i. Specifically, a second average value between the second scores of the medical document j and the medical document k and a second time difference between the publication time of the medical document j and the publication time of the medical document i can be determined, and a third score of the medical document k by the medical document j can be determined according to the second average value and the second time difference; and determining a third mean value between the second scores of the medical documents k and i and a third time difference between the publication time of the medical documents k and the publication time of the medical documents i, and determining a third score of the medical documents k for the medical documents i according to the third mean value and the third time difference.

Illustratively, the third score of the medical document j to the medical document i can be represented by formula (7):

pr (i, j) is the third score of medical document j for medical document i, Pr (i) is the second score of medical document i, Pr (j) is the second score of medical document j, T_jTime of publication of medical document j, T_iIs the publication time of the medical document i, wherein the medical document j does not quote the medical document i in other cases.

Illustratively, the score of the medical document i can be represented by formula (8):

wherein, Pr_i ^HIs the score of the medical document i, Pr (i, j) is the third score of the medical document j to the medical document i, Pr_i ²Is the second score of medical document i. The final superposition of the second score of each medical document mainly considers that some isolated medical documents have certain influence on themselves, and avoids setting the score of the medical document to be 0, so that the score of each medical document is more convincing.

Referring to fig. 4, fig. 4 is a block diagram of functional units of a topic determination device for medical document clusters according to an embodiment of the present application. A topic determination apparatus 400 for a medical document cluster, comprising: an acquisition unit 401 and a processing unit 402, wherein:

an acquisition unit 401 for acquiring a plurality of medical documents;

the processing unit 402 is configured to cluster the plurality of medical documents to obtain at least one medical document cluster;

a processing unit 402, further configured to determine a target medical document in each of the at least one medical document cluster;

a processing unit 402, further configured to determine a candidate phrase set corresponding to each medical document cluster;

the processing unit 402 is further configured to determine a topic corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster.

In some possible embodiments, in determining the target medical document in each of the at least one medical document cluster, the processing unit 402 is specifically configured to:

acquiring the citation relation among the medical documents in each medical document cluster;

determining the scores of the medical documents in each medical document cluster according to the citation relation among the medical documents in each medical document cluster, wherein the scores of the medical documents are used for representing the importance degree of the medical documents;

and determining the target medical literature in each medical literature cluster according to the order of the scores from large to small.

In some possible embodiments, in determining the candidate phrase set corresponding to each medical document cluster, the processing unit 402 is specifically configured to:

determining phrases corresponding to the medical documents in each medical document cluster according to the titles and the abstracts of the medical documents in each medical document cluster;

forming a first phrase set by phrases corresponding to each medical document in each medical document cluster;

and screening the phrases in the first phrase set to obtain a candidate phrase set corresponding to each medical document cluster.

In some possible embodiments, in the aspect of filtering the phrases in the first phrase set to obtain the candidate phrase set corresponding to each medical document cluster, the processing unit 402 is specifically configured to:

mapping the abbreviated phrases in the first phrase set into full names to obtain a second phrase set;

deleting the phrases which only contain one word in the second phrase set to obtain a third phrase set;

determining phrases with the same semantic meaning in the third phrase set, and replacing the phrases with the same semantic meaning in the third phrase set with standardized phrases to obtain a fourth candidate phrase set;

and taking the fourth phrase set as a candidate phrase set corresponding to each medical document cluster.

In some possible embodiments, in terms of determining the topic corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster, the processing unit 402 is specifically configured to:

performing word embedding on the title of the target medical document in each medical document cluster to obtain a first feature vector corresponding to each medical document cluster;

performing word embedding on each phrase in the candidate phrase set to obtain a second feature vector corresponding to each phrase in the candidate phrase set;

performing word embedding on each word in each phrase in the candidate phrase set to obtain a third feature vector corresponding to each word;

determining a fourth feature vector corresponding to each phrase in the candidate phrase set according to the third feature vector corresponding to each word;

determining a word frequency-inverse text frequency TF-IDF for each phrase in the set of candidate phrases;

and determining a topic corresponding to each medical document cluster according to the first feature vector corresponding to each medical document cluster, the second feature vector corresponding to each phrase in the candidate phrase set, the fourth feature vector corresponding to each phrase in the candidate phrase set and the TF-IDF of each phrase in the candidate phrase set.

In some possible embodiments, in terms of determining a topic corresponding to each medical document cluster according to the first feature vector corresponding to each medical document cluster, the second feature vector corresponding to each phrase in the candidate phrase set, the fourth feature vector corresponding to each phrase in the candidate phrase set, and the TF-IDF of each phrase in the candidate phrase set, the processing unit 402 is specifically configured to:

determining a first similarity between a first feature vector corresponding to each medical document cluster and a second feature vector corresponding to each phrase in the candidate phrase set;

determining a second similarity between the first feature vector corresponding to each medical document cluster and a fourth feature vector corresponding to each phrase in the candidate phrase set;

determining a third similarity between each of the medical document clusters and each of the phrases in the candidate phrase set based on the first and second similarities and the TF-IDF value;

determining a fourth similarity between any two phrases in the candidate phrase set according to the second feature vector of each phrase in the candidate phrase set;

and determining a topic corresponding to each medical document cluster according to the third similarity between each medical document cluster and each phrase in the candidate phrase set and the fourth similarity between any two phrases.

In some possible embodiments, in terms of determining the corresponding topic of each medical document cluster according to the third similarity between each medical document cluster and each phrase in the candidate phrase set and the fourth similarity between any two phrases, the processing unit 402 is specifically configured to:

selecting a phrase with the maximum third similarity from the candidate phrase set as a target phrase, and moving the target phrase from the candidate phrase set to the target phrase set;

determining a maximum boundary-related MMR score corresponding to each phrase in the remaining phrases according to a third similarity between each phrase in the remaining phrases of the candidate phrase set and each medical document cluster and a second similarity between each phrase in the remaining phrases of the candidate phrase set and each target phrase in the target phrase set;

moving the phrase with the largest MMR score in the remaining phrases from the candidate phrase set to the target phrase set;

repeatedly executing the operations of determining the MMR score corresponding to each phrase in the remaining phrases of the candidate phrase set and moving the phrase with the maximum MMR score to the target phrase set until the number of the target phrases in the target phrase set reaches a preset number;

and taking the target phrases in the target phrase set reaching a preset number as the subject of each medical document cluster.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a transceiver 501, a processor 502, and a memory 503. Connected to each other by a bus 504. The memory 503 is used to store computer programs and data, and may transmit the data stored by the memory 503 to the processor 502.

The processor 502 is configured to read the computer program in the memory 503 to perform the following operations:

In some possible embodiments, in determining the target medical document in each of the at least one medical document cluster, the processor 502 is specifically configured to:

In some possible embodiments, in determining the candidate phrase set corresponding to each medical document cluster, the processor 502 is specifically configured to perform the following operations:

In some possible embodiments, in the aspect of filtering the phrases in the first phrase set to obtain the candidate phrase set corresponding to each medical document cluster, the processor 502 is specifically configured to perform the following operations:

In some possible embodiments, in determining the topic corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster, the processor 502 is specifically configured to perform the following operations:

In some possible embodiments, in determining the topic corresponding to each medical document cluster according to the first feature vector corresponding to each medical document cluster, the second feature vector corresponding to each phrase in the candidate phrase set, the fourth feature vector corresponding to each phrase in the candidate phrase set, and the TF-IDF of each phrase in the candidate phrase set, the processor 502 is specifically configured to:

In some possible embodiments, in terms of determining the corresponding topic of each medical document cluster according to the third similarity between each medical document cluster and each phrase in the candidate phrase set and the fourth similarity between any two phrases, the processor 502 is specifically configured to:

Specifically, the transceiver 501 may be the obtaining unit 401 of the medical literature cluster topic determination apparatus 400 in the embodiment shown in fig. 4, and the processor 502 may be the processing unit 402 of the medical literature cluster topic determination apparatus 400 in the embodiment shown in fig. 4.

It should be understood that the subject determination device of the medical literature cluster in the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (MID), a wearable device, or the like. The subject determination devices of the medical literature clusters are only examples and not exhaustive, and include but are not limited to the subject determination devices of the medical literature clusters. In practical applications, the topic determination apparatus for the medical literature cluster may further include: intelligent vehicle-mounted terminal, computer equipment and the like.

Embodiments of the present application also provide a computer storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the medical literature cluster topic determination methods as described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the medical literature cluster topic determination methods as set forth in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for topic determination for a medical document cluster, comprising:

2. The method of claim 1, wherein the determining the target medical document in each of the at least one medical document cluster comprises:

3. The method of claim 1 or 2, wherein said determining a set of candidate phrases corresponding to each of said medical document clusters comprises:

4. The method of claim 3, wherein the filtering the phrases in the first phrase set to obtain the candidate phrase set corresponding to each medical document cluster comprises:

5. The method according to any one of claims 1-4, wherein the determining the topic corresponding to each medical document cluster according to the target medical document in each medical document cluster and the candidate phrase set corresponding to each medical document cluster comprises:

6. The method of claim 5, wherein determining the topic corresponding to each medical document cluster according to the first feature vector corresponding to each medical document cluster, the second feature vector corresponding to each phrase in the candidate phrase set, the fourth feature vector corresponding to each phrase in the candidate phrase set, and the TF-IDF of each phrase in the candidate phrase set comprises:

7. The method of claim 6, wherein determining the topic corresponding to each medical document cluster according to a third similarity between each medical document cluster and each phrase in the candidate phrase set and a fourth similarity between any two phrases comprises:

8. A topic determination apparatus for a medical document cluster, comprising:

an acquisition unit for acquiring a plurality of medical documents;

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.