CN113724884A - Disease symptom and weight knowledge acquisition and processing method based on disease case base - Google Patents

Disease symptom and weight knowledge acquisition and processing method based on disease case base Download PDF

Info

Publication number
CN113724884A
CN113724884A CN202111031558.2A CN202111031558A CN113724884A CN 113724884 A CN113724884 A CN 113724884A CN 202111031558 A CN202111031558 A CN 202111031558A CN 113724884 A CN113724884 A CN 113724884A
Authority
CN
China
Prior art keywords
disease
symptom
weight
symptoms
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111031558.2A
Other languages
Chinese (zh)
Inventor
金芝
李戈
陆军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202111031558.2A priority Critical patent/CN113724884A/en
Publication of CN113724884A publication Critical patent/CN113724884A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a disease symptom and weight knowledge acquisition and processing method based on a disease case library, which takes a mass of disease case libraries on the Internet as an information source and automatically acquires the disease symptom and weight knowledge thereof by processing original data of the information source; the method comprises the following steps: adopting a regular expression to match HTML labels, and obtaining disease symptom original data through a web crawler strategy; performing word similarity calculation and synonym recognition to obtain a medical word similarity table and a medical word synonym table; and (4) carrying out classification, TF-IDF word frequency statistics and dimensionless processing to obtain a plurality of parameters such as disease symptoms and weights thereof, and using the parameters to evaluate the relation between the diseases and symptoms integrally. By adopting the technical scheme provided by the invention, a large amount of manpower, financial resources and time can be saved; the obtained disease symptoms and the weight result thereof are more reasonable; the system is suitable for medical guidance systems, disease self-diagnosis systems based on the Internet and other scenes.

Description

Disease symptom and weight knowledge acquisition and processing method based on disease case base
The application is a divisional application of a patent application entitled "method for acquiring and processing disease symptoms and weight knowledge based on a disease case library", the original application date is 2016, 09 and 21 days, and the application number is 201610836533.2.
Technical Field
The invention relates to an internet data acquisition and processing method, in particular to a disease symptom and weight knowledge acquisition and processing method based on a disease case base.
Background
Symptoms are subjective, abnormal sensations or objective changes in the pathological condition of a patient caused by a series of abnormal changes in the function, metabolism and morphological structure of the body during the course of a disease. The symptom is the first step of disease investigation from doctors to patients, is the main content of inquiry, and is an important clue and main basis for diagnosing and differentiating diseases.
In the self-diagnosis and medical guidance expert system of diseases, patient information can not be obtained through professional medical auxiliary examination equipment generally, and only preliminary diagnosis can be carried out depending on symptoms of patients, so that a disease symptom related knowledge base needs to be constructed. Generally, in the process of system development, the traditional method for constructing disease symptoms and weight knowledge base is to be converged with knowledge engineers to obtain relevant knowledge from domain experts or relevant technical documents, and the method has large empirical factors, consumes much manpower and financial resources, has long period and is a bottleneck problem of system development.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a disease symptom and weight knowledge acquiring and processing method based on a disease case base, which automatically acquires the disease symptom and the weight knowledge by processing the original data of an information source and provides a medical knowledge base for the auxiliary diagnosis of diseases.
The principle of the invention is as follows: the role (weight) that different symptoms play in the disease diagnosis criteria is different. For example, the diagnostic criteria for stroke vary in importance of symptoms such as hemiplegia, facial distortion, slurred speech, headache, dizziness, etc., and if a patient has symptoms such as hemiplegia, facial distortion, slurred speech, etc., the possibility of stroke is high; but not only headache, dizziness, etc. Therefore, quantitatively and scientifically checking the weight of symptoms has very important significance in establishing disease diagnosis standards. The method mainly uses a massive disease case library on the Internet as an information source, adopts a regular expression to carry out HTML label matching, obtains disease symptom original data through a web crawler strategy, and then obtains disease symptoms and weight medical knowledge thereof after processing the original data through word similarity calculation, synonym identification and matching, classification, TF-IDF word frequency statistics, dimensionless treatment and the like.
In order to achieve the purpose, the invention provides the following scheme:
a disease symptom and weight knowledge obtaining and processing method based on a disease case library takes a mass of disease case libraries on the Internet as an information source, and automatically obtains the disease symptom and weight knowledge thereof by processing original data of the information source; the method comprises the following steps:
1) acquiring disease symptom original data comprising a disease name and corresponding symptom information;
2) performing word similarity calculation on the original data to obtain a medical word similarity table; carrying out synonym manual identification on the medical term similarity table to obtain a medical term synonym table; specifically, a single Chinese character literal similarity calculation method based on gravity center backward shift is adopted to calculate and obtain a medical word similarity table; the single Chinese character literal similarity algorithm based on gravity center backward shift is as follows:
let the word w1And w2Has a similarity of sim (w)1,w2);|w1I and | w2Respectively represents w1And w2The number of characters contained; same (w)1,w2) Denotes w1And w2All contain the Same morpheme set, | Same (w)1,w2) L represents the number of the same morphemes; w is a1(i) Denotes w1The ith morpheme in (1), weight (w)1I) represents w1The weight of the ith morpheme in (1), if w1(i)∈Same(w1,w2) Weight (w)1I) i, otherwise Weight (w)1,i)=0;
Figure BDA0003245449070000021
Denotes w1The sum of all morphemes in (a); w is a2(j) And w1(i) The same process is carried out; position coefficient d is taken as | w1I and | w2The smaller value in the ratio of i, i.e.:
Figure BDA0003245449070000022
Figure BDA0003245449070000023
there are two factors that affect word similarity: the number of the same morphemes contained between two words and the position weight of the same morphemes in each word. The word similarity can then be calculated according to the following formula:
Figure BDA0003245449070000024
Figure BDA0003245449070000025
alpha and beta respectively represent the weight coefficients of the similarity of the number of the same morphemes and the similarity of the position relationship of the same morphemes, and satisfy that alpha + beta is 1;
3) classifying and counting the original data to obtain the corresponding relation and distribution condition of the disease name and the symptoms;
4) obtaining the weight of each symptom in the disease;
5) carrying out dimensionless treatment; specifically, the sum of the weights of symptoms in the disease is used as a basic measurement unit, and the weights of the symptoms in the disease are subjected to non-dimensionalization treatment;
thereby obtaining a plurality of parameters of the disease symptoms, including: the frequency of a symptom appearing in a disease, the probability of a symptom appearing in a disease set, the weight of a symptom in a disease before non-dimensionalization, and the weight of a symptom in a disease after non-dimensionalization are used for overall evaluation of the relationship between a disease and a symptom.
Preferably, step 1) is to perform label matching by analyzing html labels of web page source codes of disease cases on the internet and adopting a regular expression, and to obtain original data of disease symptoms by a web crawler strategy.
Preferably, step 2) obtains the medical term synonym table by manual screening recognition.
Preferably, the medical term synonym table is also perfected according to a domain expert recognition method.
Preferably, step 4) adopts a TF-IDF word frequency statistical model based on text mining to calculate and obtain the weight of symptoms in the disease.
Preferably, the mathematical formula of the TF-IDF word frequency statistical model is as follows: w ═ TF × IDF ═ i/m × log (N/N); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.
Preferably, the mathematical formula of the TF-IDF word frequency statistical model is as follows: w is TF × IDF (i/m) × log (N/(N + 0.1)); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.
Preferably, the formula for non-dimensionalizing the weights of the symptoms in the disease by using the sum of the weights of the symptoms in the disease as a basic unit of measure is as follows:
Figure BDA0003245449070000031
wherein, wiRepresents the weight of symptom i in the disease before dimensionless treatment,
Figure BDA0003245449070000032
represents the sum of the weights of the symptoms of the disease before dimensionless treatment, WiRepresents the weight of symptom i in the disease after dimensionless treatment.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a disease symptom and weight knowledge acquisition and processing method based on a disease case library. By adopting the technical scheme provided by the invention, a large amount of manpower, financial resources and time are saved, and the weight of the disease symptoms obtained quantitatively by adopting methods such as statistics and the like for massive and real cases is more reasonable than the weight of empirical disease symptoms obtained from field experts. The application also specifically discloses a single Chinese character literal similarity calculation method based on gravity center backward shift for calculating and obtaining the medical word similarity table and a specific single Chinese character literal similarity calculation method based on gravity center backward shift. The data result can be further applied to the following two aspects:
firstly, the knowledge base is used for a medical guidance system to guide a patient to a corresponding department for accurate diagnosis after the initial diagnosis of the disease is obtained;
the other is a knowledge base for an internet-based disease self-diagnosis system, the target population of the system is common residents rather than specific doctor populations, and the system can be used for enabling patients to carry out preliminary diagnosis according to the symptom information of the patients, so that the patients can know the relevant conditions of diseases in advance for reference.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a block flow diagram of a method for acquiring and processing disease symptoms and their weight knowledge based on a case base in an embodiment of the present invention;
FIG. 2 is an example of a partial medical term similarity table and an example of a partial medical term synonym table in an embodiment of the present disclosure;
FIG. 3 is a parameter set of symptoms in an example of type 2 diabetes mellitus in accordance with an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a disease symptom and weight knowledge acquiring and processing method based on a disease case library.
The method mainly uses a massive disease case library on the Internet as an information source, adopts a regular expression to carry out HTML label matching, obtains disease symptom original data through a web crawler strategy, and then obtains disease symptoms and weight medical knowledge thereof after processing the original data through word similarity calculation, synonym identification and matching, classification, TF-IDF word frequency statistics, dimensionless treatment and the like. This saves a lot of manpower, financial resources and time, and the weight of disease symptoms obtained quantitatively by using methods such as statistics on a large number of real cases is more reasonable than the weight of empirical disease symptoms obtained from field experts.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flow chart of a method for acquiring and processing disease symptoms and weight knowledge thereof based on a case base in an embodiment of the present invention, as shown in fig. 1, the present embodiment includes the following steps:
1) disease symptom raw data acquisition
In the embodiment of the invention, html labels of webpage source codes of pneumonia cases acquired by a certain website community of the Internet are analyzed, regular expressions are adopted for label matching, and original data of disease symptoms are obtained through a web crawler strategy. Some of the disease symptoms raw data in this example are as follows:
type 2 diabetes: polydipsia, polyphagia and weight loss
Hypertension: hypertension, dizziness and chest distress
Coronary heart disease: oppression and pain in the precordial region and short breath
Hypertension: hypertension and short breath
Hypertension: hypertension, obesity, asthenia
Tuberculosis: low fever and night sweat
Diabetic peripheral neuritis: numbness of extremities and pain of limbs
Type 2 diabetes: frequent urination, eating and urination
Chronic obstructive pulmonary disease: cough, white phlegm
Viral myocarditis: fever and sore throat
Chronic obstructive pulmonary disease: cough and asthma
Type 2 diabetes: frequent urination, eating and urination
Hyperthyroidism: aversion to heat, profuse sweating and emaciation
Systemic lupus erythematosus: facial butterfly erythema and arthralgia
Rheumatoid arthritis: swelling and stiffness of joints
Type 1 diabetes mellitus: frequent urination and frequent urination
Gout: uric acid increase and joint swelling
Iron deficiency anemia: dizziness and fatigue
2) Disease symptom raw data processing
21) Performing word similarity calculation and synonym recognition on original data
211) Calculating similarity of medical terms by using single Chinese character literal similarity calculation method based on gravity center backward shift
A considerable part of medical words in the raw data acquired from the Internet have the same or similar meanings and are synonyms or near synonyms, and the following conclusions can be obtained through analysis: the medical terms containing part of the same Chinese characters have stronger similarity in the literal, and the expression meanings are also the same or similar, such as abdominal discomfort, epigastric discomfort, chest pain, chronic obstructive pulmonary disease, chronic pulmonary embolism, choledocholithiasis and the like, so that the similarity of the medical terms is calculated by adopting a single Chinese character literal similarity calculation method based on the shift-back of the center of gravity.
The single Chinese character literal similarity algorithm based on the gravity center backward shift is described as follows:
let the word w1And w2Has a similarity of sim (w)1,w2);|w1I and | w2Respectively represents w1And w2The number of characters contained; same (w)1,w2) Denotes w1And w2All contain the Same morpheme set, | Same (w)1,w2) L represents the number of the same morphemes; w is a1(i) Denotes w1The ith morpheme in (1), weight (w)1I) represents w1The weight of the ith morpheme in (1), if w1(i)∈Same(w1,w2) Weight (w)1I) i, otherwise Weight (w)1,i)=0;
Figure BDA0003245449070000061
Denotes w1The sum of all morphemes in (a); w is a2(j) And w1(i) The same process is carried out; position coefficient d is taken as | w1I and | w2The smaller value in the ratio of i, i.e.:
Figure BDA0003245449070000071
there are two factors that affect word similarity: the number of the same morphemes contained between two words and the position weight of the same morphemes in each word. The word similarity can then be calculated according to the following formula:
Figure BDA0003245449070000072
in the above formula, α and β represent weight coefficients of the similarity of the number of the same morphemes and the similarity of the positional relationship of the same morphemes, respectively, and α + β is 1.
In this example, α is 0.4, β is 0.6, and the similarity between "abdominal discomfort" and "upper abdominal discomfort" is 0.81, the similarity between "chest pain" and "chest pain" is 0.525, the similarity between "chronic obstructive pulmonary disease" and "chronic pulmonary embolism" is 0.4652, and the similarity between "common bile duct stone" and "common bile duct lower stone" is 0.703.
212) Medical term synonym recognition
In the field of information retrieval, the concept of synonyms is not equal to that of linguistics and daily life, and the synonyms do not consider emotional colors and moods, and refer to one or more words capable of mutually replacing and expressing the same or similar concepts.
Setting a threshold value of sim (w1, w2), obtaining a medical word similarity table by adopting a single Chinese character literal similarity algorithm based on gravity center backward shift on the acquired original data, manually screening and identifying synonyms, and storing the synonyms in the synonym table. Certainly, the algorithm has the defects that the expressions of partial words have the same or similar meanings, such as high fever, diarrhea and diarrhea, but do not contain the same Chinese characters, and the similarity of the words obtained by the algorithm is 0, so that the synonym table is also required to be perfected by means of field experts.
The partial medical term similarity table and the partial medical term synonym table identified by manual screening are shown in fig. 2.
22) Carrying out synonym matching on the obtained disease symptom original data, and then carrying out classification and statistical treatment, wherein the distribution condition of symptoms in the disease is obtained after treatment by taking coronary heart disease, hypertension, type 2 diabetes, community-acquired pneumonia and primary liver cancer as examples; the classification and statistical processing uses existing data processing methods.
23) Calculating weights for symptoms in diseases using text mining based TF-IDF word frequency statistical model
And classifying and statistically processing the obtained disease symptom original data, and calculating the weight of the symptom in the disease by adopting a text mining TF-IDF (Trans-inverse discrete frequency) based word frequency statistical model. The mathematical formula of the TF-IDF word frequency statistical model is as follows:
w is TF × IDF (i/m) × log (N/N) (formula 3)
Wherein TF represents the frequency of a symptom occurring in a disease, and is obtained by dividing the frequency i of the symptom occurring in the disease by the total frequency m of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.
In order to prevent N from becoming 1 in actual calculation, N may be added with a correction coefficient, and N +0.1, that is, W — TF × IDF (i/m) × log (N/(N +0.1)) may be taken.
24) Dimensionless treatment
In the multi-index comprehensive evaluation, physical meanings represented by the indexes are different, so that the indexes are different in dimension, and the overall evaluation of the object is influenced by the different dimension. The dimensionless processing of the index is a main means for solving this problem.
Because the physical quantities are related in a certain relationship, some independent physical quantities are taken as basic measurement units, and the measurement units of other physical quantities are calculated on the basis of the basic measurement units. In the system, the symptom weights are subjected to non-dimensionalization treatment, and the sum of the symptom weights in the disease is taken as a basic measurement unit
Figure BDA0003245449070000081
Wherein, wiRepresents the weight of symptom i in the disease before dimensionless treatment,
Figure BDA0003245449070000082
represents the sum of the weights of the symptoms of the disease before dimensionless treatment, WiRepresents the weight of symptom i in the disease after dimensionless treatment. After the treatment of non-dimensionalization,
Figure BDA0003245449070000083
the disease symptom raw data are processed to obtain parameters of the disease symptoms, taking type 2 diabetes as an example, the parameters of the symptoms are shown in table 1, and table 1 is a schematic table of TF/IDF/Wi values of the type 2 diabetes symptoms.
TABLE 1
Figure BDA0003245449070000091
The invention has the following beneficial effects:
(1) the invention saves a great deal of manpower, financial resources and time, and the weight of the disease symptoms obtained quantitatively by adopting methods such as statistics and the like for massive and real cases is more reasonable than the weight of the empirical disease symptoms obtained from field experts
(2) Through the algorithm disclosed by the invention, the similarity of the medical words can be accurately calculated, and a foundation with higher reliability is provided for the finally obtained data result.
(3) The data result obtained according to the invention can be used in the knowledge base of the medical guidance system to guide the patient to the corresponding department for accurate diagnosis after the initial diagnosis of the disease is obtained.
(4) The data results obtained according to the invention can also be used in the knowledge base of the internet-based disease self-diagnosis system, the target population of which is common residents rather than a specific doctor group, and the system can be used for making patients preliminarily diagnose according to the symptom information of the patients, so that the patients can know the relevant conditions of diseases in advance for reference.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A disease symptom and weight knowledge obtaining and processing method based on a disease case library is characterized in that a massive disease case library on the Internet is used as an information source, and the disease symptom and the weight knowledge thereof are automatically obtained by processing original data of the information source; the method comprises the following steps:
1) acquiring disease symptom original data comprising a disease name and corresponding symptom information;
2) performing word similarity calculation on the original data to obtain a medical word similarity table; carrying out synonym manual identification on the medical term similarity table to obtain a medical term synonym table; specifically, a single Chinese character literal similarity calculation method based on gravity center backward shift is adopted to calculate and obtain a medical word similarity table; the single Chinese character literal similarity algorithm based on gravity center backward shift is as follows:
let the word w1And w2Has a similarity of sim (w)1,w2);|w1I and | w2Respectively represents w1And w2The number of characters contained; same (w)1,w2) Denotes w1And w2All contain the Same morpheme set, | Same (w)1,w2) L represents the number of the same morphemes; w is a1(i) Denotes w1The ith morpheme in (1), weight (w)1I) represents w1The weight of the ith morpheme in (1), if w1(i)∈Same(w1,w2) Weight (w)1I) i, otherwise Weight (w)1,i)=0;
Figure FDA0003245449060000011
Denotes w1The sum of all morphemes in (a); w is a2(j) And w1(i) The same process is carried out; position coefficient d is taken as | w1I and | w2The smaller value in the ratio of i, i.e.:
Figure FDA0003245449060000012
Figure FDA0003245449060000013
there are two factors that affect word similarity: the number of the same morphemes contained between two words and the position weight of the same morphemes in each word; the word similarity can then be calculated according to the following formula:
Figure FDA0003245449060000014
Figure FDA0003245449060000015
alpha and beta respectively represent the weight coefficients of the similarity of the number of the same morphemes and the similarity of the position relationship of the same morphemes, and satisfy that alpha + beta is 1;
3) classifying and counting the original data to obtain the corresponding relation and distribution condition of the disease name and the symptoms;
4) obtaining the weight of each symptom in the disease;
5) carrying out dimensionless treatment; specifically, the sum of the weights of symptoms in the disease is used as a basic measurement unit, and the weights of the symptoms in the disease are subjected to non-dimensionalization treatment;
thereby obtaining a plurality of parameters of the disease symptoms, including: the frequency of a symptom appearing in a disease, the probability of a symptom appearing in a disease set, the weight of a symptom in a disease before non-dimensionalization, and the weight of a symptom in a disease after non-dimensionalization are used for overall evaluation of the relationship between a disease and a symptom.
2. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case library as claimed in claim 1, wherein step 1) acquires original data of the disease symptoms through web crawler strategies by analyzing html tags of web page source codes of the disease cases on the internet, performing tag matching by adopting regular expressions.
3. The method for acquiring and processing knowledge of disease symptoms and their weights based on the case base as claimed in claim 1, wherein step 2) acquires the medical term synonym table by manual screening recognition.
4. The method for acquiring and processing knowledge of disease symptoms and their weights based on the case base of claim 3, wherein the medical term synonym table is further refined according to domain expert recognition methods.
5. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the case base as claimed in claim 1, wherein the step 4) adopts a text mining TF-IDF word frequency statistical model to calculate the weight of acquiring the symptoms in the disease.
6. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base as claimed in claim 1, wherein the mathematical formula of the TF-IDF word frequency statistical model is as follows: w ═ TF × IDF ═ i/m × log (N/N); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.
7. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base as claimed in claim 1, wherein the mathematical formula of the TF-IDF word frequency statistical model is as follows: w is TF × IDF (i/m) × log (N/(N + 0.1)); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.
8. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base according to claim 1, wherein the formula for non-dimensionalizing the weights of the symptoms in the disease with the sum of the weights of the symptoms in the disease as a basic unit of measure is:
Figure FDA0003245449060000021
wherein, wiRepresents the weight of symptom i in the disease before dimensionless treatment,
Figure FDA0003245449060000031
represents the sum of the weights of the symptoms of the disease before dimensionless treatment, WiRepresents the weight of symptom i in the disease after dimensionless treatment.
CN202111031558.2A 2016-09-21 2016-09-21 Disease symptom and weight knowledge acquisition and processing method based on disease case base Pending CN113724884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111031558.2A CN113724884A (en) 2016-09-21 2016-09-21 Disease symptom and weight knowledge acquisition and processing method based on disease case base

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111031558.2A CN113724884A (en) 2016-09-21 2016-09-21 Disease symptom and weight knowledge acquisition and processing method based on disease case base
CN201610836533.2A CN106372439A (en) 2016-09-21 2016-09-21 Method for acquiring and processing disease symptoms and weight knowledge thereof based on case library

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201610836533.2A Division CN106372439A (en) 2016-09-21 2016-09-21 Method for acquiring and processing disease symptoms and weight knowledge thereof based on case library

Publications (1)

Publication Number Publication Date
CN113724884A true CN113724884A (en) 2021-11-30

Family

ID=57897784

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111031558.2A Pending CN113724884A (en) 2016-09-21 2016-09-21 Disease symptom and weight knowledge acquisition and processing method based on disease case base
CN201610836533.2A Pending CN106372439A (en) 2016-09-21 2016-09-21 Method for acquiring and processing disease symptoms and weight knowledge thereof based on case library

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201610836533.2A Pending CN106372439A (en) 2016-09-21 2016-09-21 Method for acquiring and processing disease symptoms and weight knowledge thereof based on case library

Country Status (1)

Country Link
CN (2) CN113724884A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114334065A (en) * 2022-03-07 2022-04-12 阿里巴巴达摩院(杭州)科技有限公司 Medical record processing method, computer readable storage medium and computer device

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066818A (en) * 2017-03-30 2017-08-18 深圳市金立通信设备有限公司 A kind of methods for the diagnosis of diseases and terminal
CN107680689A (en) * 2017-05-05 2018-02-09 平安科技(深圳)有限公司 Potential disease estimating method, system and the readable storage medium storing program for executing of medical text
CN108153734A (en) * 2017-12-26 2018-06-12 北京嘉和美康信息技术有限公司 A kind of text handling method and device
CN108257670B (en) * 2018-01-22 2021-06-22 北京颐圣智能科技有限公司 Method and device for establishing medical interpretation model
CN108346474B (en) * 2018-03-14 2021-09-28 湖南省蓝蜻蜓网络科技有限公司 Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution
CN109448860A (en) * 2018-09-10 2019-03-08 平安科技(深圳)有限公司 Disease data mapping method, device, computer equipment and storage medium
CN109326352B (en) * 2018-10-26 2022-04-15 腾讯科技(深圳)有限公司 Disease prediction method, device, terminal and storage medium
CN110085307B (en) * 2019-04-04 2023-02-03 华东理工大学 Intelligent diagnosis guiding method and system based on multi-source knowledge graph fusion
CN112837813A (en) * 2019-11-25 2021-05-25 北京搜狗科技发展有限公司 Automatic inquiry method and device
CN113689923A (en) * 2020-05-19 2021-11-23 北京平安联想智慧医疗信息技术有限公司 Medical data processing apparatus, system and method
CN111816301A (en) * 2020-07-07 2020-10-23 平安科技(深圳)有限公司 Medical inquiry assisting method, device, electronic equipment and medium
CN111863240A (en) * 2020-07-08 2020-10-30 中润普达(十堰)大数据中心有限公司 Disease cognitive system based on abnormal change of human body fluid
CN111816321B (en) * 2020-07-09 2022-06-14 武汉东湖大数据交易中心股份有限公司 System, apparatus and storage medium for intelligent infectious disease identification based on legal diagnostic criteria
CN111951955A (en) * 2020-08-13 2020-11-17 神州数码医疗科技股份有限公司 Method and device for constructing clinical decision support system based on rule reasoning
CN112002415B (en) * 2020-08-23 2024-03-01 吾征智能技术(北京)有限公司 Intelligent cognitive disease system based on human excrement
CN112002416A (en) * 2020-08-23 2020-11-27 吾征智能技术(北京)有限公司 Disease symptom prediction system based on urine character self-learning
CN112002413B (en) * 2020-08-23 2023-09-29 吾征智能技术(北京)有限公司 Intelligent cognitive system, equipment and storage medium for cardiovascular system infection
CN111985246B (en) * 2020-08-27 2023-08-15 武汉东湖大数据交易中心股份有限公司 Disease cognitive system based on main symptoms and accompanying symptom words
CN112017774B (en) * 2020-08-31 2023-10-03 吾征智能技术(北京)有限公司 Method and system for constructing disease prediction model based on halitosis accompanying symptoms
CN112435761A (en) * 2020-12-04 2021-03-02 中国信息通信研究院 Information recommendation method and device
CN112768082A (en) * 2021-02-04 2021-05-07 常熟和医信息技术有限公司 Method for automatically giving disease diagnosis and treatment scheme according to electronic medical record text
CN113641784A (en) * 2021-06-25 2021-11-12 合肥工业大学 Medical knowledge recommendation method and system integrating medical teaching and research

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912856A (en) * 2016-04-11 2016-08-31 北京科技大学 Traditional Chinese medicine symptom structured method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050181350A1 (en) * 2004-02-18 2005-08-18 Anuthep Benja-Athon Pattern of medical words and terms
CN102156812A (en) * 2011-04-02 2011-08-17 中国医学科学院医学信息研究所 Hospital decision-making aiding method based on symptom similarity analysis
US8965818B2 (en) * 2012-05-16 2015-02-24 Siemens Aktiengesellschaft Method and system for supporting a clinical diagnosis
CN104102816B (en) * 2014-06-20 2017-07-25 周晋 Auto-check system and method with machine learning is matched based on symptom
CN104463754B (en) * 2014-12-30 2018-01-23 天津迈沃医药技术股份有限公司 The method for building up of medical information ontology database based on genius morbi

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912856A (en) * 2016-04-11 2016-08-31 北京科技大学 Traditional Chinese medicine symptom structured method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宋艳: "基于文本挖掘词频反文档频率方法的疾病症状权重挖掘研究", 成都信息工程学院学报, vol. 29, no. 1 *
梁璐: "基于VSM权重改进算法的智能导医系统研究", 中国优秀硕士学位论文全文数据库 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114334065A (en) * 2022-03-07 2022-04-12 阿里巴巴达摩院(杭州)科技有限公司 Medical record processing method, computer readable storage medium and computer device

Also Published As

Publication number Publication date
CN106372439A (en) 2017-02-01

Similar Documents

Publication Publication Date Title
CN113724884A (en) Disease symptom and weight knowledge acquisition and processing method based on disease case base
Wijdicks et al. Comparison of the Full Outline of UnResponsiveness score and the Glasgow Coma Scale in predicting mortality in critically ill patients
CN108648827B (en) Cardiovascular and cerebrovascular disease risk prediction method and device
CN109949938B (en) Method and device for standardizing medical non-standard names
CN108511056A (en) Therapeutic scheme based on patients with cerebral apoplexy similarity analysis recommends method and system
Menon et al. Prediction of outcome in severe traumatic brain injury
CN113688255A (en) Knowledge graph construction method based on Chinese electronic medical record
CN109360658B (en) Disease pattern mining method and device based on word vector model
Chen et al. Early short-term prediction of emergency department length of stay using natural language processing for low-acuity outpatients
Willett et al. Influence of menstrual cycle estradiol-β-17 fluctuations on energy substrate utilization-oxidation during aerobic, endurance exercise
CN111524570A (en) Ultrasonic follow-up patient screening method based on machine learning
González Fernández et al. Changes in salivary levels of creatine kinase, lactate dehydrogenase, and aspartate aminotransferase after playing rugby sevens: the influence of gender
Arts et al. Training in data definitions improves quality of intensive care data
Sedghi et al. Mining clinical text for stroke prediction
Shine Use of routine clinical laboratory data to define reference intervals
Kasiak et al. Validity of the maximal heart rate prediction models among runners and cyclists
Kocher et al. Allometric grip strength norms for American children
Wei et al. Embedding electronic health records for clinical information retrieval
CN111128375B (en) Tibetan medicine diagnosis auxiliary device based on multi-label learning
Denny et al. Preoperative nutritional status and risk for subsyndromal delirium in older adults following joint replacement surgery
Lin et al. The prediction value of Glasgow coma scale-pupils score in neurocritical patients: a retrospective study
Williams et al. Psychometric properties of the Jefferson Scale of Empathy: a COSMIN systematic review protocol
CN114822788A (en) Intelligent doctor recommendation method based on doctor-patient interaction data driving
Khalili et al. Investigating Depression and Its Relationship with Social Health Components and Healthy Lifestyle in Infertile Women
Sofulu et al. Validity and reliability of the diabetes family support and conflict scale in Turkish

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination