CN113724884A

CN113724884A - Disease symptom and weight knowledge acquisition and processing method based on disease case base

Info

Publication number: CN113724884A
Application number: CN202111031558.2A
Authority: CN
Inventors: 金芝; 李戈; 陆军
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-09-21
Filing date: 2016-09-21
Publication date: 2021-11-30
Also published as: CN106372439A

Abstract

The invention relates to a disease symptom and weight knowledge acquisition and processing method based on a disease case library, which takes a mass of disease case libraries on the Internet as an information source and automatically acquires the disease symptom and weight knowledge thereof by processing original data of the information source; the method comprises the following steps: adopting a regular expression to match HTML labels, and obtaining disease symptom original data through a web crawler strategy; performing word similarity calculation and synonym recognition to obtain a medical word similarity table and a medical word synonym table; and (4) carrying out classification, TF-IDF word frequency statistics and dimensionless processing to obtain a plurality of parameters such as disease symptoms and weights thereof, and using the parameters to evaluate the relation between the diseases and symptoms integrally. By adopting the technical scheme provided by the invention, a large amount of manpower, financial resources and time can be saved; the obtained disease symptoms and the weight result thereof are more reasonable; the system is suitable for medical guidance systems, disease self-diagnosis systems based on the Internet and other scenes.

Description

Disease symptom and weight knowledge acquisition and processing method based on disease case base

The application is a divisional application of a patent application entitled "method for acquiring and processing disease symptoms and weight knowledge based on a disease case library", the original application date is 2016, 09 and 21 days, and the application number is 201610836533.2.

Technical Field

The invention relates to an internet data acquisition and processing method, in particular to a disease symptom and weight knowledge acquisition and processing method based on a disease case base.

Background

Symptoms are subjective, abnormal sensations or objective changes in the pathological condition of a patient caused by a series of abnormal changes in the function, metabolism and morphological structure of the body during the course of a disease. The symptom is the first step of disease investigation from doctors to patients, is the main content of inquiry, and is an important clue and main basis for diagnosing and differentiating diseases.

In the self-diagnosis and medical guidance expert system of diseases, patient information can not be obtained through professional medical auxiliary examination equipment generally, and only preliminary diagnosis can be carried out depending on symptoms of patients, so that a disease symptom related knowledge base needs to be constructed. Generally, in the process of system development, the traditional method for constructing disease symptoms and weight knowledge base is to be converged with knowledge engineers to obtain relevant knowledge from domain experts or relevant technical documents, and the method has large empirical factors, consumes much manpower and financial resources, has long period and is a bottleneck problem of system development.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a disease symptom and weight knowledge acquiring and processing method based on a disease case base, which automatically acquires the disease symptom and the weight knowledge by processing the original data of an information source and provides a medical knowledge base for the auxiliary diagnosis of diseases.

The principle of the invention is as follows: the role (weight) that different symptoms play in the disease diagnosis criteria is different. For example, the diagnostic criteria for stroke vary in importance of symptoms such as hemiplegia, facial distortion, slurred speech, headache, dizziness, etc., and if a patient has symptoms such as hemiplegia, facial distortion, slurred speech, etc., the possibility of stroke is high; but not only headache, dizziness, etc. Therefore, quantitatively and scientifically checking the weight of symptoms has very important significance in establishing disease diagnosis standards. The method mainly uses a massive disease case library on the Internet as an information source, adopts a regular expression to carry out HTML label matching, obtains disease symptom original data through a web crawler strategy, and then obtains disease symptoms and weight medical knowledge thereof after processing the original data through word similarity calculation, synonym identification and matching, classification, TF-IDF word frequency statistics, dimensionless treatment and the like.

In order to achieve the purpose, the invention provides the following scheme:

a disease symptom and weight knowledge obtaining and processing method based on a disease case library takes a mass of disease case libraries on the Internet as an information source, and automatically obtains the disease symptom and weight knowledge thereof by processing original data of the information source; the method comprises the following steps:

1) acquiring disease symptom original data comprising a disease name and corresponding symptom information;

2) performing word similarity calculation on the original data to obtain a medical word similarity table; carrying out synonym manual identification on the medical term similarity table to obtain a medical term synonym table; specifically, a single Chinese character literal similarity calculation method based on gravity center backward shift is adopted to calculate and obtain a medical word similarity table; the single Chinese character literal similarity algorithm based on gravity center backward shift is as follows:

let the word w₁And w₂Has a similarity of sim (w)₁,w₂)；|w₁I and | w₂Respectively represents w₁And w₂The number of characters contained; same (w)₁,w₂) Denotes w₁And w₂All contain the Same morpheme set, | Same (w)₁,w₂) L represents the number of the same morphemes; w is a₁(i) Denotes w₁The ith morpheme in (1), weight (w)₁I) represents w₁The weight of the ith morpheme in (1), if w₁(i)∈Same(w₁,w₂) Weight (w)₁I) i, otherwise Weight (w)₁,i)＝0；

Denotes w₁The sum of all morphemes in (a); w is a₂(j) And w₁(i) The same process is carried out; position coefficient d is taken as | w₁I and | w₂The smaller value in the ratio of i, i.e.:

there are two factors that affect word similarity: the number of the same morphemes contained between two words and the position weight of the same morphemes in each word. The word similarity can then be calculated according to the following formula:

alpha and beta respectively represent the weight coefficients of the similarity of the number of the same morphemes and the similarity of the position relationship of the same morphemes, and satisfy that alpha + beta is 1;

3) classifying and counting the original data to obtain the corresponding relation and distribution condition of the disease name and the symptoms;

4) obtaining the weight of each symptom in the disease;

5) carrying out dimensionless treatment; specifically, the sum of the weights of symptoms in the disease is used as a basic measurement unit, and the weights of the symptoms in the disease are subjected to non-dimensionalization treatment;

thereby obtaining a plurality of parameters of the disease symptoms, including: the frequency of a symptom appearing in a disease, the probability of a symptom appearing in a disease set, the weight of a symptom in a disease before non-dimensionalization, and the weight of a symptom in a disease after non-dimensionalization are used for overall evaluation of the relationship between a disease and a symptom.

Preferably, step 1) is to perform label matching by analyzing html labels of web page source codes of disease cases on the internet and adopting a regular expression, and to obtain original data of disease symptoms by a web crawler strategy.

Preferably, step 2) obtains the medical term synonym table by manual screening recognition.

Preferably, the medical term synonym table is also perfected according to a domain expert recognition method.

Preferably, step 4) adopts a TF-IDF word frequency statistical model based on text mining to calculate and obtain the weight of symptoms in the disease.

Preferably, the mathematical formula of the TF-IDF word frequency statistical model is as follows: w ═ TF × IDF ═ i/m × log (N/N); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.

Preferably, the mathematical formula of the TF-IDF word frequency statistical model is as follows: w is TF × IDF (i/m) × log (N/(N + 0.1)); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.

Preferably, the formula for non-dimensionalizing the weights of the symptoms in the disease by using the sum of the weights of the symptoms in the disease as a basic unit of measure is as follows:

wherein, w_iRepresents the weight of symptom i in the disease before dimensionless treatment,

represents the sum of the weights of the symptoms of the disease before dimensionless treatment, W_iRepresents the weight of symptom i in the disease after dimensionless treatment.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a disease symptom and weight knowledge acquisition and processing method based on a disease case library. By adopting the technical scheme provided by the invention, a large amount of manpower, financial resources and time are saved, and the weight of the disease symptoms obtained quantitatively by adopting methods such as statistics and the like for massive and real cases is more reasonable than the weight of empirical disease symptoms obtained from field experts. The application also specifically discloses a single Chinese character literal similarity calculation method based on gravity center backward shift for calculating and obtaining the medical word similarity table and a specific single Chinese character literal similarity calculation method based on gravity center backward shift. The data result can be further applied to the following two aspects:

firstly, the knowledge base is used for a medical guidance system to guide a patient to a corresponding department for accurate diagnosis after the initial diagnosis of the disease is obtained;

the other is a knowledge base for an internet-based disease self-diagnosis system, the target population of the system is common residents rather than specific doctor populations, and the system can be used for enabling patients to carry out preliminary diagnosis according to the symptom information of the patients, so that the patients can know the relevant conditions of diseases in advance for reference.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a block flow diagram of a method for acquiring and processing disease symptoms and their weight knowledge based on a case base in an embodiment of the present invention;

FIG. 2 is an example of a partial medical term similarity table and an example of a partial medical term synonym table in an embodiment of the present disclosure;

FIG. 3 is a parameter set of symptoms in an example of type 2 diabetes mellitus in accordance with an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a disease symptom and weight knowledge acquiring and processing method based on a disease case library.

The method mainly uses a massive disease case library on the Internet as an information source, adopts a regular expression to carry out HTML label matching, obtains disease symptom original data through a web crawler strategy, and then obtains disease symptoms and weight medical knowledge thereof after processing the original data through word similarity calculation, synonym identification and matching, classification, TF-IDF word frequency statistics, dimensionless treatment and the like. This saves a lot of manpower, financial resources and time, and the weight of disease symptoms obtained quantitatively by using methods such as statistics on a large number of real cases is more reasonable than the weight of empirical disease symptoms obtained from field experts.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flow chart of a method for acquiring and processing disease symptoms and weight knowledge thereof based on a case base in an embodiment of the present invention, as shown in fig. 1, the present embodiment includes the following steps:

1) disease symptom raw data acquisition

In the embodiment of the invention, html labels of webpage source codes of pneumonia cases acquired by a certain website community of the Internet are analyzed, regular expressions are adopted for label matching, and original data of disease symptoms are obtained through a web crawler strategy. Some of the disease symptoms raw data in this example are as follows:

type 2 diabetes: polydipsia, polyphagia and weight loss

Hypertension: hypertension, dizziness and chest distress

Coronary heart disease: oppression and pain in the precordial region and short breath

Hypertension: hypertension and short breath

Hypertension: hypertension, obesity, asthenia

Tuberculosis: low fever and night sweat

Diabetic peripheral neuritis: numbness of extremities and pain of limbs

Type 2 diabetes: frequent urination, eating and urination

Chronic obstructive pulmonary disease: cough, white phlegm

Viral myocarditis: fever and sore throat

Chronic obstructive pulmonary disease: cough and asthma

Type 2 diabetes: frequent urination, eating and urination

Hyperthyroidism: aversion to heat, profuse sweating and emaciation

Systemic lupus erythematosus: facial butterfly erythema and arthralgia

Rheumatoid arthritis: swelling and stiffness of joints

Type 1 diabetes mellitus: frequent urination and frequent urination

Gout: uric acid increase and joint swelling

Iron deficiency anemia: dizziness and fatigue

2) Disease symptom raw data processing

21) Performing word similarity calculation and synonym recognition on original data

211) Calculating similarity of medical terms by using single Chinese character literal similarity calculation method based on gravity center backward shift

A considerable part of medical words in the raw data acquired from the Internet have the same or similar meanings and are synonyms or near synonyms, and the following conclusions can be obtained through analysis: the medical terms containing part of the same Chinese characters have stronger similarity in the literal, and the expression meanings are also the same or similar, such as abdominal discomfort, epigastric discomfort, chest pain, chronic obstructive pulmonary disease, chronic pulmonary embolism, choledocholithiasis and the like, so that the similarity of the medical terms is calculated by adopting a single Chinese character literal similarity calculation method based on the shift-back of the center of gravity.

The single Chinese character literal similarity algorithm based on the gravity center backward shift is described as follows:

in the above formula, α and β represent weight coefficients of the similarity of the number of the same morphemes and the similarity of the positional relationship of the same morphemes, respectively, and α + β is 1.

In this example, α is 0.4, β is 0.6, and the similarity between "abdominal discomfort" and "upper abdominal discomfort" is 0.81, the similarity between "chest pain" and "chest pain" is 0.525, the similarity between "chronic obstructive pulmonary disease" and "chronic pulmonary embolism" is 0.4652, and the similarity between "common bile duct stone" and "common bile duct lower stone" is 0.703.

212) Medical term synonym recognition

In the field of information retrieval, the concept of synonyms is not equal to that of linguistics and daily life, and the synonyms do not consider emotional colors and moods, and refer to one or more words capable of mutually replacing and expressing the same or similar concepts.

Setting a threshold value of sim (w1, w2), obtaining a medical word similarity table by adopting a single Chinese character literal similarity algorithm based on gravity center backward shift on the acquired original data, manually screening and identifying synonyms, and storing the synonyms in the synonym table. Certainly, the algorithm has the defects that the expressions of partial words have the same or similar meanings, such as high fever, diarrhea and diarrhea, but do not contain the same Chinese characters, and the similarity of the words obtained by the algorithm is 0, so that the synonym table is also required to be perfected by means of field experts.

The partial medical term similarity table and the partial medical term synonym table identified by manual screening are shown in fig. 2.

22) Carrying out synonym matching on the obtained disease symptom original data, and then carrying out classification and statistical treatment, wherein the distribution condition of symptoms in the disease is obtained after treatment by taking coronary heart disease, hypertension, type 2 diabetes, community-acquired pneumonia and primary liver cancer as examples; the classification and statistical processing uses existing data processing methods.

23) Calculating weights for symptoms in diseases using text mining based TF-IDF word frequency statistical model

And classifying and statistically processing the obtained disease symptom original data, and calculating the weight of the symptom in the disease by adopting a text mining TF-IDF (Trans-inverse discrete frequency) based word frequency statistical model. The mathematical formula of the TF-IDF word frequency statistical model is as follows:

w is TF × IDF (i/m) × log (N/N) (formula 3)

Wherein TF represents the frequency of a symptom occurring in a disease, and is obtained by dividing the frequency i of the symptom occurring in the disease by the total frequency m of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.

In order to prevent N from becoming 1 in actual calculation, N may be added with a correction coefficient, and N +0.1, that is, W — TF × IDF (i/m) × log (N/(N +0.1)) may be taken.

24) Dimensionless treatment

In the multi-index comprehensive evaluation, physical meanings represented by the indexes are different, so that the indexes are different in dimension, and the overall evaluation of the object is influenced by the different dimension. The dimensionless processing of the index is a main means for solving this problem.

Because the physical quantities are related in a certain relationship, some independent physical quantities are taken as basic measurement units, and the measurement units of other physical quantities are calculated on the basis of the basic measurement units. In the system, the symptom weights are subjected to non-dimensionalization treatment, and the sum of the symptom weights in the disease is taken as a basic measurement unit

represents the sum of the weights of the symptoms of the disease before dimensionless treatment, W_iRepresents the weight of symptom i in the disease after dimensionless treatment. After the treatment of non-dimensionalization,

the disease symptom raw data are processed to obtain parameters of the disease symptoms, taking type 2 diabetes as an example, the parameters of the symptoms are shown in table 1, and table 1 is a schematic table of TF/IDF/Wi values of the type 2 diabetes symptoms.

TABLE 1

The invention has the following beneficial effects:

(1) the invention saves a great deal of manpower, financial resources and time, and the weight of the disease symptoms obtained quantitatively by adopting methods such as statistics and the like for massive and real cases is more reasonable than the weight of the empirical disease symptoms obtained from field experts

(2) Through the algorithm disclosed by the invention, the similarity of the medical words can be accurately calculated, and a foundation with higher reliability is provided for the finally obtained data result.

(3) The data result obtained according to the invention can be used in the knowledge base of the medical guidance system to guide the patient to the corresponding department for accurate diagnosis after the initial diagnosis of the disease is obtained.

(4) The data results obtained according to the invention can also be used in the knowledge base of the internet-based disease self-diagnosis system, the target population of which is common residents rather than a specific doctor group, and the system can be used for making patients preliminarily diagnose according to the symptom information of the patients, so that the patients can know the relevant conditions of diseases in advance for reference.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A disease symptom and weight knowledge obtaining and processing method based on a disease case library is characterized in that a massive disease case library on the Internet is used as an information source, and the disease symptom and the weight knowledge thereof are automatically obtained by processing original data of the information source; the method comprises the following steps:

there are two factors that affect word similarity: the number of the same morphemes contained between two words and the position weight of the same morphemes in each word; the word similarity can then be calculated according to the following formula:

4) obtaining the weight of each symptom in the disease;

2. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case library as claimed in claim 1, wherein step 1) acquires original data of the disease symptoms through web crawler strategies by analyzing html tags of web page source codes of the disease cases on the internet, performing tag matching by adopting regular expressions.

3. The method for acquiring and processing knowledge of disease symptoms and their weights based on the case base as claimed in claim 1, wherein step 2) acquires the medical term synonym table by manual screening recognition.

4. The method for acquiring and processing knowledge of disease symptoms and their weights based on the case base of claim 3, wherein the medical term synonym table is further refined according to domain expert recognition methods.

5. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the case base as claimed in claim 1, wherein the step 4) adopts a text mining TF-IDF word frequency statistical model to calculate the weight of acquiring the symptoms in the disease.

6. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base as claimed in claim 1, wherein the mathematical formula of the TF-IDF word frequency statistical model is as follows: w ═ TF × IDF ═ i/m × log (N/N); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.

7. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base as claimed in claim 1, wherein the mathematical formula of the TF-IDF word frequency statistical model is as follows: w is TF × IDF (i/m) × log (N/(N + 0.1)); wherein TF represents the frequency of a symptom occurring in a disease, obtained by dividing the number i of occurrences of the symptom in the disease by the total number m of occurrences of all symptoms in the disease; IDF represents the probability of a symptom appearing in a disease set, and is obtained by dividing the number N of the disease set by the number N of diseases containing the symptom and taking the logarithm of the obtained quotient; the weight of the symptom is expressed by the product of TF and IDF.

8. The method for acquiring and processing disease symptoms and weight knowledge thereof based on the disease case base according to claim 1, wherein the formula for non-dimensionalizing the weights of the symptoms in the disease with the sum of the weights of the symptoms in the disease as a basic unit of measure is: