CN109493978B - Disease research hotspot mining method and device, storage medium and electronic equipment - Google Patents

Disease research hotspot mining method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN109493978B
CN109493978B CN201811338754.2A CN201811338754A CN109493978B CN 109493978 B CN109493978 B CN 109493978B CN 201811338754 A CN201811338754 A CN 201811338754A CN 109493978 B CN109493978 B CN 109493978B
Authority
CN
China
Prior art keywords
candidate
word
disease
researched
hot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811338754.2A
Other languages
Chinese (zh)
Other versions
CN109493978A (en
Inventor
李林峰
张春宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyiyun Technology Co ltd
Original Assignee
Beijing Yiyiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyiyun Technology Co ltd filed Critical Beijing Yiyiyun Technology Co ltd
Priority to CN201811338754.2A priority Critical patent/CN109493978B/en
Publication of CN109493978A publication Critical patent/CN109493978A/en
Application granted granted Critical
Publication of CN109493978B publication Critical patent/CN109493978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The disclosure relates to the technical field of computers, and in particular relates to a disease research hotspot mining method and device, a storage medium and electronic equipment. The method comprises the following steps: respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear; respectively calculating specificity factors of the candidate hot words according to the number of diseases which appear simultaneously with the candidate hot words; calculating a hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word; and determining a target hot word of the disease to be researched in the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor. The method and the device greatly increase the accuracy rate of determining the target hot words.

Description

Disease research hotspot mining method and device, storage medium and electronic equipment
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a disease research hotspot mining method and device, a storage medium and electronic equipment.
Background
Generally, medical workers do not have too much diagnosis and treatment experience to treat new diseases, difficult and complicated diseases, and variations of original diseases, so that when a treatment scheme is prepared, a scientific and reasonable treatment scheme needs to be prepared according to the research result of the diseases, and therefore, the research of the diseases plays an increasingly important role in the field of medical technology.
However, in the process of disease research, how to accurately determine the target hot words to provide accurate research directions for researchers according to the target hot words, so as to save research cost and shorten research time has become one of the important subjects for disease research. At present, a plurality of candidate hot words corresponding to a disease are obtained, and a candidate hot word with the highest occurrence frequency is determined as a target hot word of the disease according to the occurrence frequency of each candidate hot word.
Obviously, in the above manner, the target hot words of the disease are determined only according to the occurrence frequency of the candidate hot words, and the accuracy of obtaining the target hot words is reduced due to single consideration factor, so that an incorrect research direction may be provided for research personnel, thereby increasing the research cost and prolonging the research time.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure aims to provide a disease research hotspot mining method and apparatus, a storage medium, and an electronic device, so as to overcome, at least to a certain extent, the problem that in the process of determining a target hotspot word, due to a single consideration factor, the accuracy of obtaining the target hotspot word is reduced, and further, a wrong research direction may be provided for a research and development staff, so that the research cost is increased, and the research time is prolonged.
According to an aspect of the present disclosure, there is provided a disease research hotspot mining method, including:
acquiring a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear;
respectively calculating specificity factors of the candidate hot words according to the number of diseases which appear simultaneously with the candidate hot words;
calculating a hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word;
and determining a target hot word of the disease to be researched in the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor.
In an exemplary embodiment of the disclosure, the calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the number of medical documents in which the disease to be researched and the candidate hot word occur simultaneously includes:
calculating co-occurrence factors of the disease to be researched and each candidate hot word in each publication year according to the number of medical documents in which the disease to be researched and the candidate hot word simultaneously appear in each publication year and the year coefficient of each publication year;
and calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the co-occurrence factors of the disease to be researched and each candidate hot word in each publication year.
In an exemplary embodiment of the disclosure, the calculating specificity factors of the candidate hot words according to the number of diseases occurring simultaneously with the candidate hot words respectively includes:
respectively acquiring the number of diseases which appear simultaneously with each candidate hot word;
respectively taking the logarithm of the number of diseases which simultaneously appear with each candidate hot word to obtain the specificity factor of each candidate hot word.
In an exemplary embodiment of the present disclosure, the calculating a hot spot trend factor of each candidate hot spot word according to a change rate of each candidate hot spot word includes:
calculating the change rate of each candidate hot word in each publication year according to the quantity of the medical documents including each candidate hot word in each publication year and the quantity of the medical documents including each candidate hot word in the previous year of each publication year;
and calculating the hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word of each publication year.
In an exemplary embodiment of the disclosure, the calculating a change rate of each candidate hot word in each publication year according to the number of medical documents including each candidate hot word in each publication year and the number of medical documents including each candidate hot word in a previous year in each publication year respectively includes:
calculating the change rate of each candidate hot word in each publication year according to the difference between the quantity of the medical documents including each candidate hot word in each publication year and the quantity of the medical documents including each candidate hot word in the previous year of each publication year and the year coefficient of each publication year.
In an exemplary embodiment of the disclosure, the obtaining a disease to be studied and a plurality of candidate hot words corresponding to the disease to be studied includes:
and acquiring the disease to be researched, and determining a plurality of keywords which appear simultaneously with the disease to be researched in each medical document as a plurality of candidate hot words.
In an exemplary embodiment of the disclosure, the determining, according to the co-occurrence factor of the disease to be researched and each of the candidate hot words, the specificity factor of each of the candidate hot words, and the hot trend factor, a target hot word of the disease to be researched in the plurality of candidate hot words includes:
calculating the hot score of each candidate hot word according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor;
and determining a target hot word of the disease to be researched in a plurality of candidate hot words according to the hot score of each candidate hot word.
In an exemplary embodiment of the disclosure, the calculating a hot score of each candidate hot word according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word, and the hot trend factor includes:
calculating the hot score of each candidate hot word through the following formula:
goal(i)=0.5*Ai*Bi+0.5*Ci
wherein, good (i) is the hot score of the ith candidate hot word, AiCo-occurrence factor B of the disease to be researched and the ith candidate hotspot wordiSpecificity factor, C, of the ith candidate hotspot wordiThe hot spot trend factor is the hot spot trend factor of the ith candidate hot spot word.
In an exemplary embodiment of the disclosure, the determining, according to the popularity score of each candidate hot word, a target hot word of the disease to be researched from among a plurality of candidate hot words includes:
and sequencing the candidate hot words according to the sequence of the hot scores from high to low, and determining the candidate hot word ranked at the first position as a target hot word.
According to an aspect of the present disclosure, there is provided a disease research hotspot excavating device, comprising:
the system comprises an acquisition module, a search module and a search module, wherein the acquisition module is used for acquiring a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
the first calculation module is used for respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear;
the second calculation module is used for calculating specificity factors of the candidate hot words according to the number of diseases which appear simultaneously with the candidate hot words;
the third calculation module is used for calculating the hot spot trend factors of the candidate hot spot words according to the change rate of the candidate hot spot words;
the determining module is used for determining a target hot word of the disease to be researched from the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a disease research hotspot mining method of any one of the above.
According to an aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform any of the disease research hotspot mining methods described above via execution of the executable instructions.
The disclosure provides a disease research hotspot mining method and device, a storage medium and electronic equipment. And determining a target hot word of the disease to be researched in the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor of each candidate hot word. On one hand, when the target hot words of the diseases to be researched are determined, compared with the prior art, the co-occurrence factors of the diseases to be researched and the candidate hot words, the specificity factors of the candidate hot words and the hot trend factors of the candidate hot words are considered, rather than simply considering the occurrence frequency of the candidate hot words, the accuracy rate of determining the target hot words of the diseases to be researched is greatly increased; on the other hand, the accuracy of determining the target hot words of the diseases to be researched is increased, so that a correct research direction is provided for researchers, the research cost is greatly reduced, and the research time is shortened.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
FIG. 1 is a flow chart of a disease research hotspot mining method of the present disclosure;
FIG. 2 is a block diagram of a disease research hotspot mining device of the present disclosure;
FIG. 3 is a block diagram view of an electronic device in an exemplary embodiment of the disclosure;
FIG. 4 is a schematic diagram illustrating a program product in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, a method for mining disease research hotspots is first disclosed, and referring to fig. 1, the method for mining disease research hotspots may include the following steps:
step S110, acquiring a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
step S120, respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear;
step S130, calculating specificity factors of the candidate hot words according to the number of diseases which appear simultaneously with the candidate hot words;
step S140, calculating a hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word;
step S150, determining a target hot word of the disease to be researched from the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor.
According to the disease research hotspot mining method in the exemplary embodiment, on one hand, when determining the target hotspot words of the disease to be researched, compared with the prior art, the co-occurrence factors of the disease to be researched and each candidate hotspot word, the specificity factors of each candidate hotspot word and the hotspot trend factors of each candidate hotspot word are considered, rather than simply considering the occurrence frequency of each candidate hotspot word, so that the accuracy rate of determining the target hotspot words of the disease to be researched is greatly increased; on the other hand, the accuracy of determining the target hot words of the diseases to be researched is increased, so that a correct research direction is provided for researchers, the research cost is greatly reduced, and the research time is shortened.
In step S110, a disease to be studied and a plurality of candidate hot words corresponding to the disease to be studied are obtained.
In the present exemplary embodiment, the disease to be studied may be acquired, and a plurality of keywords appearing simultaneously with the disease to be studied in each medical document may be determined as a plurality of candidate hot words. Specifically, the disease to be studied may be determined by a researcher, for example, hypertension, leukemia, meningitis, etc., and this is not a specific limitation in the present exemplary embodiment. After the disease to be researched is determined, the name of the disease to be researched is input into the medical literature base, whether the name of the disease to be researched is included in the keywords in the abstract of each medical literature in the medical literature base or not is judged, and if the name of the disease to be researched is included, the remaining keywords except the disease to be researched in the abstract are determined to be candidate hot words. For example, if the summary of the medical document a includes 6 keywords, if the 6 keywords include the name of the disease to be studied, the remaining 5 keywords excluding the disease to be studied in the summary of the medical document a are determined as 5 candidate hot words.
In step S120, co-occurrence factors of the disease to be studied and each candidate hot word are respectively calculated according to the number of medical documents in which the disease to be studied and the candidate hot word occur simultaneously.
In the exemplary embodiment, the name of the disease to be researched and a candidate hot word may be input into a medical literature library including a large number of medical literatures, the number of the medical literatures including both the name of the disease to be researched and the candidate hot word may be obtained, the number of the medical literatures including both the name of the disease to be researched and the candidate hot word may be determined as the number of the medical literatures in which the disease to be researched and the candidate hot word occur at the same time, and the number of the medical literatures in which the disease to be researched and the candidate hot word occur at the same time may be determined as a co-occurrence factor of the disease to be researched and the candidate hot word. It should be noted that the co-occurrence factor of the disease to be studied and any candidate hot word can be calculated in the above manner. For example, the disease to be studied corresponds to 3 candidate hot words, and the 3 candidate hot words are a first candidate hot word, a second candidate hot word, and a third candidate hot word, respectively. Firstly, the name of a disease to be researched and a first candidate hot word can be input into a medical literature base, and the co-occurrence factor of the disease to be researched and the first candidate hot word is calculated according to the mode; then, the name of the disease to be researched and the second candidate hot word can be input into the medical literature base, and the co-occurrence factor of the disease to be researched and the second candidate hot word is calculated according to the mode; finally, the name of the disease to be researched and the third candidate hot word can be input into the medical literature base, and the co-occurrence factor of the disease to be researched and the third candidate hot word is calculated according to the above mode.
In order to obtain a more accurate co-occurrence factor, the calculating the co-occurrence factors of the disease to be studied and each candidate hot word according to the number of medical documents in which the disease to be studied and the candidate hot word occur simultaneously may include: calculating co-occurrence factors of the disease to be researched and each candidate hot word in each publication year according to the number of medical documents in which the disease to be researched and the candidate hot word simultaneously appear in each publication year and the year coefficient of each publication year; and calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the co-occurrence factors of the disease to be researched and each candidate hot word in each publication year.
In the exemplary embodiment, the number of medical documents in which a disease to be researched and a candidate hot word occur simultaneously in each publication year, that is, the number of medical documents in which a disease to be researched and a candidate hot word occur simultaneously in each publication year, may be obtained in the medical document library, and the co-occurrence factor of the disease to be researched and the candidate hot word in each publication year may be calculated by combining the year coefficient of each publication year. It should be noted that the co-occurrence factor of the disease to be studied and any candidate hotspot word in each publication year can be calculated according to the above process.
The year coefficient of each publication year can be calculated according to the current year and the publication year, and the specific calculation formula is as follows:
Figure BDA0001861938430000081
wherein, γjThe term "the number of years" refers to the number of years in which the researcher is currently located, X is the current year, j is the number of years in which the researcher is currently located, and γ is the number of years in which the number of years is currently located is the number of years in which the number ofj1. For example, when the current year X is 2018 and the published year j is 2010, the year coefficient of the published year 2010 is determined by the above formula
Figure BDA0001861938430000082
From the above, the year coefficient for each publication year can be calculated according to the above formula. It should be noted that the publication year j is an integer, and the maximum value of the publication year j is equal to the current year X.
Based on the year coefficients, a formula for calculating the co-occurrence factors of the diseases to be researched and the candidate hot words in each release year is as follows:
Figure BDA0001861938430000091
wherein, aijCo-occurrence factor of the disease to be studied and the ith candidate hotspot word, gamma, for publishing year jjYear coefficient of published year j, ZijThe method is characterized by comprising the following steps of obtaining medical literature quantity of a disease to be researched and an ith candidate hot word in a publication year j at the same time, namely obtaining the medical literature quantity of the disease to be researched and the ith candidate hot word in the medical literature published in the publication year j at the same time.
After the co-occurrence factors of the disease to be researched and each candidate hot word in each publication year are calculated, the co-occurrence factors of the disease to be researched and each candidate hot word can be calculated in a summing mode. The specific calculation formula is as follows:
Figure BDA0001861938430000092
wherein A isiJ is more than or equal to p and less than or equal to X which is a co-occurrence factor of the disease to be researched and the ith candidate hotspot word, namely the value range of j in the published year is [ p, X]Wherein X is the current year and p is the minimum year of publication.
The following describes the above process by taking 3 candidate hot words corresponding to a disease to be researched, where the 3 candidate hot words are a first candidate hot word, a second candidate hot word and a third candidate hot word, respectively, and the publication year is 2018-2014, and the current year is 2018.
Firstly, acquiring the number of medical documents with publication years 2014, 2015, 2016, 2017 and 2018 and simultaneously comprising the diseases to be researched and the first candidate hot spot words, respectively calculating the co-occurrence factors of the diseases to be researched and the first candidate hot spot words with publication years 2014, 2015, 2016, 2017 and 2018 according to the calculation formula of the year coefficient of each publication year and the calculation formula of the co-occurrence factor of the diseases to be researched and the first candidate hot spot words with publication years, respectively, and finally calculating the co-occurrence factors of the diseases to be researched and the first candidate hot spot words according to the calculation formula of the co-occurrence factors of the diseases to be researched and the candidate hot spot words.
And then, respectively calculating the co-occurrence factors of the disease to be researched and the second candidate hot word and the co-occurrence factors of the disease to be researched and the third candidate hot word according to the calculation process of the co-occurrence factors of the disease to be researched and the first candidate hot word.
In step S130, a specificity factor of each candidate hot word is calculated according to the number of diseases occurring simultaneously with each candidate hot word.
In this exemplary embodiment, since a certain candidate hot word may intersect with a plurality of diseases to be studied, for example, when the candidate hot word is an operation, most of tumor-related diseases (i.e., diseases to be studied) have a relationship with the candidate hot word (i.e., the operation), although the co-occurrence factor of the candidate hot word and the disease to be studied is high, the candidate hot word is not a target hot word of the disease to be studied. Based on the above reasons, in order to more accurately obtain the target hot words of the diseases to be researched, the specificity factor of each candidate hot word needs to be calculated, and the specificity factor of each candidate hot word has a negative correlation with the number of the diseases occurring at the same time as each candidate hot word.
The process of calculating the specificity factor of each candidate hot word may include: respectively acquiring the number of diseases which appear simultaneously with each candidate hot word; respectively taking the logarithm of the number of diseases which simultaneously appear with each candidate hot word to obtain the specificity factor of each candidate hot word.
Specifically, firstly, the candidate hot words are respectively input into the medical literature base, the diseases which are simultaneously appeared with the candidate hot words in each medical literature are respectively obtained, then the diseases which are simultaneously appeared with the candidate hot words and are obtained from each medical literature are respectively counted, the number of the diseases which are simultaneously appeared with the candidate hot words can be obtained, and finally, the number of the diseases which are simultaneously appeared with the candidate hot words is logarithmized respectively, so that the specificity factor of each candidate hot word is obtained. Specifically, the calculation formula of the specificity factor of each candidate hot word is as follows:
Figure BDA0001861938430000101
wherein, BiSpecificity factor of i-th candidate hotspot word, biIs the number of diseases occurring simultaneously with the ith candidate hot word.
Next, taking an example that a disease to be researched corresponds to two candidate hot words, wherein the two candidate hot words are a first candidate hot word and a second candidate hot word respectively, a process of calculating specificity factors of the two candidate hot words respectively is further described.
Firstly, inputting a first candidate hot word into a medical literature base, acquiring diseases which appear simultaneously with the first candidate hot word in each medical literature, counting the diseases which appear simultaneously with the first candidate hot word and are acquired from each medical literature, and determining the counted number as the number of the diseases which appear simultaneously with the first candidate hot word. And substituting the number of diseases which appear simultaneously with the first candidate hot spot word into the formula to calculate the specificity factor of the first candidate hot spot word. It should be noted that the specificity factor of the second candidate hotspot word is calculated in the same manner as described above.
In step S140, a hot spot trend factor of each candidate hot spot word is calculated according to the change rate of each candidate hot spot word.
In this exemplary embodiment, first, the change rate of each candidate hot word in each publication year may be calculated according to the number of medical documents including each candidate hot word in each publication year and the number of medical documents including each candidate hot word in a previous year in each publication year, and then the hot trend factor of each candidate hot word may be calculated according to the change rate of each candidate hot word in each publication year.
Specifically, the change rate of each candidate hotspot word of each publication year may be calculated in the following two ways.
In a first mode, the change rate of each candidate hot word in each publication year can be calculated through the following formula.
Figure BDA0001861938430000111
Wherein, cijRate of change, Q, of the ith candidate hotspot word for year j of publicationijThe number of medical documents including the ith candidate hot word in the publication year j is determined, namely the number of medical documents including the ith candidate hot word in the medical documents published in the publication year j, Qi(j-1)The number of the medical documents including the ith candidate hot word in the publication year j-1 is the number of the medical documents including the ith candidate hot word in the medical documents published in the publication year j-1.
In a second mode, the change rate of each candidate hot word in each publication year can be calculated according to the difference between the number of medical documents including each candidate hot word in each publication year and the number of medical documents including each candidate hot word in the previous year of each publication year, and by combining the year coefficient of each publication year.
In the present exemplary embodiment, the calculation formula of the year coefficient here is the same as that in step S120, since the year coefficient γ herejThe calculation process has been described in detail above, and therefore is not described in detail here.
Based on the year coefficient, a calculation formula for calculating the change rate of each candidate hot word of each publication year is as follows:
Figure BDA0001861938430000121
wherein, cijRate of change, Q, of the ith candidate hotspot word for year j of publicationijThe number of medical documents including the ith candidate hot word in the publication year j is determined, namely the number of medical documents including the ith candidate hot word in the medical documents published in the publication year j, Qi(j-1)The number of the medical documents including the ith candidate hot word in the publication year j-1 is the number of the medical documents including the ith candidate hot word in the medical documents published in the publication year j-1, gammajThe year coefficient for publication year j.
After the change rate of each candidate hot word of each publication year is calculated, the hot trend factor of each candidate hot word can be calculated through the following formula:
Figure BDA0001861938430000122
wherein, CiIs the hot spot trend factor of the ith candidate hot spot word, p is less than or equal to j and less than or equal to X, namely the value range of j in the published year is [ p, X]Wherein X is the current year, p is the minimum year of release, n is the number of years of release, cijThe change rate of the ith candidate hotspot word of the publication year j.
In step S150, a target hot word of the disease to be studied is determined among the candidate hot words according to a co-occurrence factor of the disease to be studied and each of the candidate hot words, the specificity factor of each of the candidate hot words, and the hot trend factor.
In this exemplary embodiment, a hot score of each candidate hot word may be first calculated according to a co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word, and the hot trend factor, and then a target hot word of the disease to be researched is determined in a plurality of candidate hot words according to the hot score of each candidate hot word.
Specifically, the hot score of each candidate hot word may be calculated by the following formula:
goal(i)=0.5*Ai*Bi+0.5*Ci
wherein, good (i) is the hot score of the ith candidate hot word, AiCo-occurrence factor B of the disease to be researched and the ith candidate hotspot wordiSpecificity factor, C, of the ith candidate hotspot wordiThe hot spot trend factor is the hot spot trend factor of the ith candidate hot spot word.
After the hot score of each candidate hot word is calculated, the candidate hot words can be ranked according to the sequence of the hot scores from high to low, and the candidate hot word ranked at the first position (i.e. the candidate hot word with the highest hot score) is determined as the target hot word; the candidate hot words ranked in the top 3 positions can also be determined as target hot words of the disease to be researched; candidate hot words with a hot score larger than a preset score may also be determined as target hot words of the disease to be studied, where the preset score may be set by a researcher, and this exemplary embodiment is not particularly limited thereto.
In summary, when determining the target hot word of the disease to be researched, compared with the prior art, the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor of each candidate hot word are considered, rather than simply considering the occurrence frequency of each candidate hot word, so that the accuracy rate of determining the target hot word of the disease to be researched is greatly increased; in addition, the accuracy of determining the target hot words of the diseases to be researched is increased, so that a correct research direction is provided for researchers, the research cost is greatly reduced, and the research time is shortened.
It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
In an exemplary embodiment of the present disclosure, there is also provided a disease research hotspot mining apparatus as shown in fig. 2, wherein the disease research hotspot mining apparatus 200 may include: an obtaining module 201, a first calculating module 202, a second calculating module 203, a third calculating module 204, and a determining module 205, wherein:
the acquiring module 201 may be configured to acquire a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
the first calculating module 202 may be configured to calculate, according to the number of medical documents in which the disease to be researched and the candidate hot word occur at the same time, a co-occurrence factor between the disease to be researched and each candidate hot word;
the second calculating module 203 may be configured to calculate specificity factors of the candidate hot words according to the number of diseases occurring simultaneously with the candidate hot words;
the third calculating module 204 may be configured to calculate a hot spot trend factor of each candidate hot spot word according to a change rate of each candidate hot spot word;
the determining module 205 may be configured to determine a target hot word of the disease to be researched from the plurality of candidate hot words according to a co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word, and the hot trend factor.
The specific details of each disease research hotspot mining device module are described in detail in the corresponding disease research hotspot mining method, and therefore are not described herein again.
It should be noted that although in the above detailed description several modules or units of the apparatus for performing are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 300 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 3, electronic device 300 is embodied in the form of a general purpose computing device. The components of electronic device 300 may include, but are not limited to: the at least one processing unit 310, the at least one memory unit 320, a bus 330 connecting different system components (including the memory unit 320 and the processing unit 310), and a display unit 340.
Wherein the storage unit stores program code that is executable by the processing unit 310 to cause the processing unit 310 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 310 may execute step S110 shown in fig. 1, obtain a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched; step S120, respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear; step S130, calculating specificity factors of the candidate hot words according to the number of diseases which appear simultaneously with the candidate hot words; step S140, calculating a hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word; step S150, determining a target hot word of the disease to be researched from the plurality of candidate hot words according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor.
The storage unit 320 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)3201 and/or a cache memory unit 3202, and may further include a read only memory unit (ROM) 3203.
The storage unit 320 may also include a program/utility 3204 having a set (at least one) of program modules 3205, such program modules 3205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 330 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 300 may also communicate with one or more external devices 370 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 300, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 300 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 350. Also, the electronic device 300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 360. As shown, network adapter 360 communicates with the other modules of electronic device 300 via bus 330. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
Referring to fig. 4, a program product 400 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (11)

1. A disease research hotspot mining method is characterized by comprising the following steps:
acquiring a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
respectively calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the quantity of the medical literature in which the disease to be researched and the candidate hot word simultaneously appear;
respectively taking the logarithm of the number of diseases which simultaneously appear with each candidate hot word to obtain the specificity factor of each candidate hot word;
calculating the change rate of each candidate hot word according to the medical literature quantity of each candidate hot word and each publication year, and calculating the hot trend factor of each candidate hot word according to the change rate of each candidate hot word of each publication year;
calculating the hot score of each candidate hot word according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor, and determining the target hot word of the disease to be researched in the plurality of candidate hot words according to the hot scores.
2. The disease research hotspot mining method of claim 1, wherein the calculating the co-occurrence factors of the disease to be researched and each candidate hotspot word respectively according to the number of medical documents in which the disease to be researched and each candidate hotspot word occur simultaneously comprises:
calculating co-occurrence factors of the disease to be researched and each candidate hot word in each publication year according to the number of medical documents in which the disease to be researched and the candidate hot word simultaneously appear in each publication year and the year coefficient of each publication year;
and calculating the co-occurrence factors of the disease to be researched and each candidate hot word according to the co-occurrence factors of the disease to be researched and each candidate hot word in each publication year.
3. The disease research hotspot mining method of claim 1, wherein the respectively logarithmizing the number of diseases occurring simultaneously with each candidate hotspot word to obtain the specificity factor of each candidate hotspot word comprises:
respectively acquiring the number of diseases which appear simultaneously with each candidate hot word;
respectively taking the logarithm of the number of diseases which simultaneously appear with each candidate hot word to obtain the specificity factor of each candidate hot word.
4. The disease research hotspot mining method of claim 1, wherein the calculating the change rate of each candidate hotspot word according to the medical literature quantity and each publication year of each candidate hotspot word, and the calculating the hotspot trend factor of each candidate hotspot word according to the change rate of each candidate hotspot word in each publication year comprises:
calculating the change rate of each candidate hot word in each publication year according to the quantity of the medical documents including each candidate hot word in each publication year and the quantity of the medical documents including each candidate hot word in the previous year of each publication year;
and calculating the hot spot trend factor of each candidate hot spot word according to the change rate of each candidate hot spot word of each publication year.
5. The disease research hotspot mining method of claim 4, wherein calculating the change rate of each candidate hotspot word in each publication year according to the quantity of medical documents including each candidate hotspot word in each publication year and the quantity of medical documents including each candidate hotspot word in a year preceding each publication year comprises:
calculating the change rate of each candidate hot word in each publication year according to the difference between the quantity of the medical documents including each candidate hot word in each publication year and the quantity of the medical documents including each candidate hot word in the previous year of each publication year and the year coefficient of each publication year.
6. The disease research hotspot mining method of claim 1, wherein the obtaining of a disease to be researched and a plurality of candidate hotspot words corresponding to the disease to be researched comprises:
and acquiring the disease to be researched, and determining a plurality of keywords which appear simultaneously with the disease to be researched in each medical document as a plurality of candidate hot words.
7. The disease research hotspot mining method of claim 1, wherein the calculating of the hotspot score of each candidate hotspot word according to the co-occurrence factor of the disease to be researched and each candidate hotspot word, the specificity factor of each candidate hotspot word and the hotspot tendency factor comprises:
calculating the hot score of each candidate hot word through the following formula:
goal(i)=0.5*Ai*Bi+0.5*Ci
wherein, good (i) is the hot score of the ith candidate hot word, AiCo-occurrence factor B of the disease to be researched and the ith candidate hotspot wordiSpecificity factor, C, of the ith candidate hotspot wordiThe hot spot trend factor is the hot spot trend factor of the ith candidate hot spot word.
8. The disease research hotspot mining method of claim 1, wherein determining the target hotspot word of the disease to be researched among the plurality of candidate hotspot words according to the hotspot score of each candidate hotspot word comprises:
and sequencing the candidate hot words according to the sequence of the hot scores from high to low, and determining the candidate hot word ranked at the first position as a target hot word.
9. A disease research hotspot digging device is characterized by comprising:
the system comprises an acquisition module, a search module and a search module, wherein the acquisition module is used for acquiring a disease to be researched and a plurality of candidate hot words corresponding to the disease to be researched;
the first calculation module is used for calculating the co-occurrence factor of the disease to be researched and each candidate hot word according to the quantity of medical documents in which the disease to be researched and the candidate hot words simultaneously appear;
the second calculation module is used for respectively taking logarithm of the number of diseases which simultaneously appear with each candidate hot word so as to obtain specificity factors of each candidate hot word;
the third calculation module is used for calculating the change rate of each candidate hot word according to the medical literature quantity of each candidate hot word and each publication year, and calculating the hot trend factor of each candidate hot word according to the change rate of each candidate hot word of each publication year;
the determining module is used for calculating the hot score of each candidate hot word according to the co-occurrence factor of the disease to be researched and each candidate hot word, the specificity factor of each candidate hot word and the hot trend factor, and determining the target hot word of the disease to be researched in the plurality of candidate hot words according to the hot scores.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the disease research hotspot mining method of any one of claims 1 to 8.
11. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the disease research hotspot mining method of any one of claims 1-8 via execution of the executable instructions.
CN201811338754.2A 2018-11-12 2018-11-12 Disease research hotspot mining method and device, storage medium and electronic equipment Active CN109493978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811338754.2A CN109493978B (en) 2018-11-12 2018-11-12 Disease research hotspot mining method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811338754.2A CN109493978B (en) 2018-11-12 2018-11-12 Disease research hotspot mining method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109493978A CN109493978A (en) 2019-03-19
CN109493978B true CN109493978B (en) 2021-05-25

Family

ID=65695490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811338754.2A Active CN109493978B (en) 2018-11-12 2018-11-12 Disease research hotspot mining method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109493978B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1744080A (en) * 2005-09-27 2006-03-08 南方医科大学 Specific function-related gene information searching system and method for building database of searching workds thereof
CN104636424A (en) * 2014-12-02 2015-05-20 南昌大学 Method for building literature review framework based on atlas analysis
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
JP6033136B2 (en) * 2013-03-18 2016-11-30 三菱電機株式会社 Information processing apparatus and navigation apparatus
CN107609017A (en) * 2017-08-04 2018-01-19 陈剑辉 The method and system of medical industry intelligent search consulting are realized by self-defined hot word
CN108090157A (en) * 2017-12-12 2018-05-29 百度在线网络技术(北京)有限公司 A kind of hot news method for digging, device and server
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1744080A (en) * 2005-09-27 2006-03-08 南方医科大学 Specific function-related gene information searching system and method for building database of searching workds thereof
JP6033136B2 (en) * 2013-03-18 2016-11-30 三菱電機株式会社 Information processing apparatus and navigation apparatus
CN104636424A (en) * 2014-12-02 2015-05-20 南昌大学 Method for building literature review framework based on atlas analysis
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN107609017A (en) * 2017-08-04 2018-01-19 陈剑辉 The method and system of medical industry intelligent search consulting are realized by self-defined hot word
CN108090157A (en) * 2017-12-12 2018-05-29 百度在线网络技术(北京)有限公司 A kind of hot news method for digging, device and server

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Mining a Clinical Data Warehous to Discover Disease-finding Associations Using Co-occurrence Statistics";Cao H等;《Amia.annual Symposium Proceedings》;20050228;第106页 *
"关键词共现研究趋势分析";郭树行;《科技资讯》;20120421(第32期);第204-205页 *
"词频变化率模型视域下美国情报学研究发展动向分析";周鑫等;《情报科学》;20170430(第04期);第169-175页 *
"近10年非酒精性脂肪性肝病研究热点供词聚类分析";张桐硕等;《实用肝脏病杂志》;20141031;第17卷(第05期);第470-474页 *

Also Published As

Publication number Publication date
CN109493978A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN107799160B (en) Medication aid decision-making method and device, storage medium and electronic equipment
CN111863170A (en) Method, device and system for generating electronic medical record information
CN109599153B (en) Medical data tracking method and device, storage medium and electronic equipment
US11232267B2 (en) Proximity information retrieval boost method for medical knowledge question answering systems
CN109542966B (en) Data fusion method and device, electronic equipment and computer readable medium
JP7285977B2 (en) Neural network training methods, devices, electronics, media and program products
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN110874364B (en) Query statement processing method, device, equipment and storage medium
CN114625923A (en) Training method of video retrieval model, video retrieval method, device and equipment
CN111061835B (en) Query method and device, electronic equipment and computer readable storage medium
CN109597989B (en) Diagnostic word normalization method and device, storage medium and electronic equipment
CN113177154A (en) Search term recommendation method and device, electronic equipment and storage medium
CN109585024B (en) Data mining method and device, storage medium and electronic equipment
CN112507075A (en) Case data searching method, system, equipment and storage medium
CN109493978B (en) Disease research hotspot mining method and device, storage medium and electronic equipment
CN111062193A (en) Medical data labeling method and device, storage medium and electronic equipment
CN111063445A (en) Feature extraction method, device, equipment and medium based on medical data
CN113688202B (en) Emotion polarity analysis method and device, electronic equipment and computer storage medium
CN111723134A (en) Information processing method, information processing device, electronic equipment and storage medium
CN111639173B (en) Epidemic situation data processing method, device, equipment and storage medium
CN109783745B (en) Method, device and computer equipment for personalized typesetting of pages
CN109597847B (en) Medical data retrogradation method and device, storage medium and electronic terminal
CN114201729A (en) Method, device and equipment for selecting matrix operation mode and storage medium
CN109885475B (en) Page conversion rate calculation method, device, computer equipment and storage medium
US20070156775A1 (en) Metadata transformation in copy and paste scenarios between heterogeneous applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant