CN108647203B

CN108647203B - Method for calculating text similarity of traditional Chinese medicine disease conditions

Info

Publication number: CN108647203B
Application number: CN201810359667.9A
Authority: CN
Inventors: 姜晓红; 付钊; 陈广; 杜定益; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2020-07-07
Anticipated expiration: 2038-04-20
Also published as: CN108647203A

Abstract

The invention discloses a method for calculating text similarity of traditional Chinese medical conditions, which comprises the following steps: obtaining a text block based on rule and statistical phrase identification; dividing text blocks to obtain text semantic blocks; calculating the weight of the text semantic block; calculating text semantic blocking vectors; combining the text semantic blocking features to obtain disease condition document features; and calculating the text similarity according to the characteristics of the disease documents. The method takes the text semantic blocks as the minimum granularity to express the characteristics of the disease condition text, divides the disease condition text into the text semantic blocks according to the described disease positions, gives different weights to each text semantic block to distinguish primary symptoms and secondary symptoms, finds out the similar symptoms of the two sections of disease condition texts by calculating the cosine value of the vector included angle of the text semantic blocks, and finally weights according to the weights to obtain the similarity of the two sections of disease condition texts, thereby overcoming the defects that the traditional text similarity calculation method loses semantic information or can not highlight the primary and secondary causes of disease.

Description

Method for calculating text similarity of traditional Chinese medicine disease conditions

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method for calculating text similarity of traditional Chinese medicine conditions.

Background

The traditional Chinese medicine dialectical diagnosis usually adopts methods of assisting physical classification, probing and countering syndromes, and the description of the state of an illness is mostly obtained by looking at, smelling, asking and cutting, and looking at the spirit, the complexion, the form, the local part, the excrement and the tongue; listening to sound and smelling smell; ask for chills and fever, ask for sweat, ask for pain, ask for diet and taste, ask for sleep, ask for stool, ask for menstruation and leukorrhagia, ask for children; the pulse-taking and palpation can be used to obtain the description of the patient's condition and record it as the disease condition.

The traditional Chinese medicine disease description text generally has the following characteristics:

1) the description text is longer. The description of the traditional Chinese medicine on the disease condition comprises various information such as physical symptom expression, daily life and the like, and the description text of the traditional Chinese medicine is often more than hundreds of characters and belongs to a longer text;

2) comprises a plurality of disease position symptoms. The dialectical traditional Chinese medicine knows the disease condition of a patient by means of looking at, smelling, asking and cutting, and describes the disease condition including the symptom expression of each part of the body;

3) relying heavily on semantic information. The traditional Chinese medicine disease text contains a plurality of descriptive sentences for whether symptoms exist on body parts, the sentences depend on semantic information, for example, eyelid edema and eyelid edema do not exist, the semantics are completely opposite by one word;

4) the text is interspersed with some verification data. With the development of science and technology, traditional Chinese medicine also starts to perform physical examination on patients by means of some instruments, such as body temperature, heart rate and the like, and the examination results are mixed in a disease text in a digital form.

In the traditional text similarity calculation method, a bag-of-words model and TF-IDF characteristics are adopted, or domain semantics and subject word characteristics are adopted, so that text semantic information is lost or the semantic information is too simple.

Patent document No. CN103617157A discloses a semantic-based text similarity calculation method, and relates to the technical field of text-oriented intelligent information processing. The method aims to solve the problem that the conventional text vector space model and cosine similarity can not be subjected to semantic correlation judgment. The semantic-based text similarity calculation comprises the following steps: preprocessing a text set, extracting initial characteristic words, and expressing the initial characteristic words into a vector model consisting of keywords and concepts; and then respectively calculating the semantic similarity of the keyword part and the semantic similarity of the concept part, and summing the two parts to finally obtain the semantic similarity of the text.

Disclosure of Invention

The invention aims to provide a method for calculating the similarity of Chinese medical condition texts, which uses text semantic blocks as minimum granularity to represent the characteristics of the condition texts, and calculates the similarity of the two condition texts by calculating cosine values of the included angles of text semantic block vectors of the same disease position in the two Chinese medical condition texts and weighting the cosine values.

A method for calculating the text similarity of the traditional Chinese medical condition comprises the following steps:

(1) based on the phrase identification of rules and statistics, text blocks are obtained from the original Chinese medical condition text: loading a traditional Chinese medicine glossary to a word segmentation toolkit, and segmenting words of the original traditional Chinese medicine illness state text by using a word segmentation tool; removing stop words in the word segmentation result by adopting a stop word library; performing word co-occurrence probability calculation, and combining two words into a phrase to obtain a text block when the parts of speech of the two words accord with a Chinese phrase rule and the co-occurrence probability is greater than a given threshold value;

(2) dividing text blocks to obtain text semantic blocks: carrying out phrase identification and phrase marking on the text block in the step (1) to obtain a disease position phrase and a description phrase, and combining the disease position phrase and the description phrase to obtain a text semantic block;

(3) calculating the weight of the text semantic block;

(4) calculating text semantic blocking vectors;

(5) combining the weights of the text semantic blocks and the text semantic block vectors respectively obtained in the steps (3) and (4) to obtain text semantic block characteristics, and combining a plurality of text semantic block characteristics to obtain illness state document characteristics;

(6) and calculating the text similarity according to the characteristics of the disease documents.

The text semantic blocking refers to a block formed by a plurality of adjacent phrases or sentences describing the same thing, disease position or symptom, and the granularity of the block is larger than that of the phrase and smaller than that of the segment; the granularity refers to the number of contained Chinese characters.

The text semantic chunk comprises one or more phrases or sentences; the phrases or sentences in the text semantic blocks describe the same disease position, symptom or thing; and the positions of the phrases or sentences in the text semantic blocks are adjacent.

The word co-occurrence probability calculation method in the step (1) comprises the following steps:

suppose { T }₁,T₂,T₃,...T_nThe results after all text word segmentation are shown, wherein T_i、T_i+1Is a word, n is the total number of words in the word segmentation result, T_iConsisting of one or more words, denoted w₁w₂..w_mThe algorithm comprises the following steps:

dividing the text after word segmentation into binary groups according to the way of dividing adjacent words into a group in pairs,wherein each binary group is as follows: t is_iT_i+1；

Counting the frequency P (T) of each word in the word segmentation result and counting each binary group T_iT_i+1Frequency of occurrence P (T)_iT_i+1)；

Is calculated at the word T_iProbability of occurrence of each word in case of occurrence of (i ∈ 1, 2.. n).

The method for combining phrases in the step (1) comprises the following steps: traversing the word segmentation result to conform to the part-of-speech collocation rule of Chinese phrases and P (T)_i+1|T_i) Word strings greater than a given threshold α are merged into phrases.

The method for recognizing the short words and marking the phrases in the step (2) comprises the following steps: matching the words in the Phrase with the words in the disease Position word library, if the matching is successful, marking the Phrase as a disease Position Phrase (PP), otherwise, marking the Phrase as a description Phrase; the Description Phrase refers to a Description Phrase (DP) for the symptoms of the disease location.

The disease position word library comprises disease position words in nine major systems of a human body motion system, a digestive system, a respiratory system, a urinary system, a reproductive system, an endocrine system, an immune system, a nervous system and a circulatory system.

In order to correctly label sentences that do not describe any symptoms of the disease, the first phrase after the period (i.e., the first phrase of the next sentence) is labeled PP, and finally, the text is labeled as follows:

D_k＝{PP₁,DP₁₁,DP₁₂,...DP_1mPP_i,DP_i1,DP_i2,...DP_in}

wherein D is_kFor the kth document, PP_iIs the ith disease phrase, DP_ijThe ith descriptive phrase is the jth descriptive phrase following the ith pathological phrase. Then the PP is mixed_iAnd the following disease phrase DP_ij(j ═ 1,2,. n) are combined into blocks, i.e. the text semantic blocks B are obtained_i。

The weight of the text semantic block in the step (3) refers to the weight of the text semantic block in calculating the similarity of the disease documents, and the text semantic block comprises disease position words; the weight value of the text semantic block is represented by the weight value of the disease word; and the weight of the ill-positioned word is obtained by calculating the document frequency DF value of the ill-positioned word in the corpus.

A disease whose symptomatic expression includes a primary symptom and a secondary symptom, the primary symptom being a symptom that the disease must exhibit, and the secondary symptom being a complication that may be caused by the disease. Therefore, for similarity calculation of disease condition texts, the primary and secondary status of symptoms need to be considered, and cannot be considered in a general way. For example, the main symptom of a cold is fever, while cough is a secondary symptom, and "fever" and "fever without cough" are described for the cases of two cold patients, and in calculating the similarity, if the primary and secondary are not considered, the description similarity of the two cases is very low, but actually, the cases are both colds and are very similar.

For example, if the corpus contains N original chinese medical condition texts, the document frequency DF (document frequency) value of each disease word in the N texts can be calculated, and the higher the DF value is, the more likely the doctor tends to ask the symptom of the disease, the more likely the symptom is the main symptom for distinguishing the disease type. Weight w_iThe calculation formula is as follows:

w_i＝df_i+α

wherein n is_iFor the number of texts appearing in the corpus of the word i, α is the basic weight, i.e., the weight of the text semantic block that does not contain any ill-posed word.

The text semantic blocking vector calculation method in the step (4) comprises the following steps:

(4-1) after segmenting the text, taking a text semantic segment as a complete input of the Doc2vec, and carrying out word vector training to obtain a Doc2vec model;

(4-2) converting each text semantic block of the document into a corresponding direction through a Doc2vec modelQuantity, whereby the entire disease document is converted into a sequence of block vectors, let w_mRepresenting text semantic blocks B_mWeight of (A), vec (B)_m) Representing text semantic blocks B_mThe feature vector of (2) is the disease condition document D_kFeature F (D)_k) Namely:

F(D_k)＝((w₁,vec(B₁)),(w₂,vec(B₂)),...(w_m,vec(B_m)))。

for example for document D_kLet D be_kContaining m semantic blocks of text, i.e. D_k＝{B₁,B₂,B₃,...B_mIn which B is_iFor text blocks i, B_iComposed of several sentences or phrases, B_iThe corresponding text block is characterized by F (B)_i) Then, there are:

F(B_i)＝(w_Bi,vec(B_i))

wherein

Blocking B for text semantics_iWeight of (A), vec (B)_i) Blocking B for text semantics_iThe feature vector of (2); document D of the disease condition_kFeature F (D)_k) Can be written as:

F(D_k)＝(F(B₁),F(B₂),F(B₃),...F(B_m))。

the method for calculating the text similarity in the step (6) measures the symptom similarity of the same disease location by calculating the cosine similarity of the text semantic block vectors, and weights are adopted to obtain the similarity of the Chinese medical condition texts.

The cosine similarity calculation method comprises the following steps:

wherein, B_1pFor the text semantic block with the number p in the first Chinese medicine illness state text, B_2qSemantically partitioning a text numbered q in a second Chinese medicine disease state text; vec (B)_1p),vec(B_2q) Are respectively B_1pAnd B_2qBlock vector of w_1pIs B_1pThe weight of (2); f (w)_1p,w_2q) Is Sim (vec (B)_1p),vec(B_2q) A weight of); | vec (B)_1p) I and vec (B)_2q) | is vec (B)_1p) And vec (B)_2q) The die of (1).

f(w_1p,w_2q) The value meaning is as follows: when block B_1pAnd block B_2qIf the description is to the same disease position, the cosine included angle of the vector corresponding to the block is calculated; otherwise, when the block B is_1pWeight value w of_1pNot equal to block B_2qWeight value w of_2qIn time, it is shown that the two blocks are not descriptions of the same disease location, and therefore there is no value in calculating similarity.

The calculation method for obtaining the similarity of the Chinese medical condition texts by weighting the weights comprises the following steps:

Sim(vec(B_1p),vec(B_2q) Is a block vector vec (B)_1p) And vec (B)_2q) The cosine similarity of the text semantic blocks in the first and second Chinese medical condition texts is m and n respectively.

Suppose two Chinese medicine condition texts D₁And D₂Calculated block vector feature F (D)₁) And F (D)₂) Is represented as follows:

F(D₁)＝((w₁₁,vec(B₁₁)),(w₁₂,vec(B₁₂)),...(w_1m,vec(B_1m)))

F(D₂)＝((w₂₁,vec(B₂₁)),(w₂₂,vec(B₂₂)),...(w_2n,vec(B_2n)))

wherein, B_ijText for the ith medical conditionText semantic Block with number j, w_ijBlocking B for text semantics_ijWeight of (A), vec (B)_ij) Blocking B for text semantics_ijThe block vectors m and n are the number of text semantic blocks in the Chinese medical condition texts 1 and 2 respectively.

Text of the traditional Chinese medical conditions D₁And D₂The similarity degree of symptoms of the same disease location can be measured by calculating the cosine similarity of the semantic block vectors, and the similarity degree of the Chinese medical condition text is obtained by weighting by adopting the weight.

The invention relates to a Chinese medical condition text, which comprises a plurality of symptom descriptions of disease positions, wherein one symptom description of a disease position is mostly corresponding to one or more phrases or sentences, and one condition text can be regarded as a set consisting of text semantic block description sentences of a plurality of disease position symptoms.

The method for calculating the similarity of the Chinese medical condition texts overcomes the defects that the traditional text similarity calculation method loses semantic information or cannot highlight the primary and secondary causes of diseases, expresses the characteristics of the condition texts by taking the text semantic blocks of the disease positions as the minimum granularity, and calculates the similarity of the two condition texts by calculating the cosine value of the included angle of the semantic block vectors of the same disease positions in the two condition texts and weighting.

Drawings

FIG. 1 is a schematic flow chart of a calculation method provided by the present invention;

fig. 2 is a schematic diagram of a specific process for obtaining text semantic partitions in the calculation method provided by the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments. It is to be understood that such description is merely illustrative of the features and advantages of the present invention, and is not intended to limit the scope of the claims.

As shown in fig. 1, a method for calculating the similarity of the text of the condition of traditional Chinese medicine comprises the following steps.

(1) And obtaining a text block based on the phrase recognition of the rule and the statistics.

The specific flow is shown in fig. 2, a traditional Chinese medicine glossary is loaded to a word segmentation tool bag, a word segmentation tool is used for segmenting words of an original traditional Chinese medicine disease text, and stop words in a word segmentation result are removed by adopting a stop word lexicon; and (4) performing word co-occurrence probability calculation, and combining the two words into a phrase to obtain a text block when the part of speech of the two words accords with the Chinese phrase rule and the co-occurrence probability is greater than a given threshold value.

(2) And dividing the text blocks to obtain text semantic blocks.

And performing phrase identification and phrase marking to obtain a disease position phrase and a description phrase, and combining the disease position phrase and the description phrase to obtain a text semantic block.

(3) And calculating the weight of the text semantic block.

(4) And calculating text semantic blocking vectors.

(5) Combining the text semantic blocking features to obtain the disease condition document features.

Suppose that two sections of illness state texts, a text A and a text B, are provided, the contents are respectively as follows:

text a:

the jugular vein has no anger, red throat, no swelling of tonsil, and slightly coarse respiratory tone of both lungs, and has no obvious dry and wet rale; the abdomen is soft, no tenderness and rebound pain, no percussion pain on the kidneys, and no swelling on the lower limbs.

Text B:

hot and face, red and congested pharynx, unsmooth and obvious swelling of tonsil, coarse respiratory sounds of two lungs, unsmooth and obvious dry and wet rales and increased texture.

And carrying out similarity calculation on the text A and the text B.

1. Obtaining the following text through word segmentation and stop word segmentation in the step (1), wherein an Ansj word segmentation tool is adopted for word segmentation:

text a:

the jugular vein has no anger, red throat, no swelling of tonsil, coarse respiratory tone of both lungs, and no dry moist rale; the abdomen is soft, there is no tenderness and pain, the pain of the kidney is not knocked out, and the lower limbs are not swollen.

Text B:

hot face, red throat, congestion, non-swollen tonsil, coarse respiratory sounds of the two lungs, non-dry and wet rale and increased texture.

2. Performing word combination through the Chinese phrase rule and word co-occurrence probability calculation in the step (1), and obtaining the following results:

text a:

{ jugular vein } { no anger, { pharynx red, } { tonsil } { no swelling, and { double lung } { breath sound coarse, } { no } { dry and wet rale; the patient can be treated by the following steps of { belly softness, } { no pressure pain } { rebound pain, } { double kidneys } { no tapping pain, } { double lower limbs } { no swelling. }

Text B:

{ hot appearance, } { pharynx red } { congestion, } { tonsil } { non-swelling, } { two lungs } { breath sound coarse, } { non } { dry and wet rale, } { texture increase. }

3. Obtaining disease location phrases and description phrases through phrase identification and phrase marking in the step (2), combining the disease location phrases and the description phrases to obtain text semantic blocks, wherein jugular veins, pharynx, tonsil, lung, abdomen, kidney and lower limbs all belong to disease location words, and obtaining results are as follows:

text a:

{ jugular vein has no anger, { pharynx red, } { tonsil has no swelling, and { bipulmonary breath is coarse and wet rale is not dry; the abdomen is soft, there is no pain and the pain is got back to jumping, { two kidneys have no pain of knocking, } two lower limbs are not swollen. }

Text B:

{ hot face, } { pharyngeal red congestion, } { tonsil not swollen, } { two lung breath sound is coarse, not dry and wet, and texture is increased. }

4. After the text blocks are divided, the text A comprises text semantic blocks with 7 disease positions, the text B comprises text semantic blocks with 4 disease positions, and if the corpus only comprises two texts A and B, α is 1, the number, weight and vector of each block are shown in table 1, wherein the text semantic block vector depends on the corpus of the Doc2vec training model and needs to be calculated according to the actual corpus.

TABLE 1 weights and vectors for text semantic segmentation in text A and text B

5. And combining the text semantic blocking features in the text A and the text B in the table 1 to obtain the disease condition document features.

6. And calculating the text similarity according to the characteristics of the disease documents.

The similarity of the text a and the text B is calculated according to the following formula.

In the text semantic blocks of the text A and the text B, A2 and B2 are text semantic blocks with the same disease position, A3 and B3 are text semantic blocks with the same disease position, and A4 and B4 are text semantic blocks with the same disease position.

Then

Wherein:

the cosine value of the included angle between the vector of the text semantic block A2 and the vector of the vector B2 is referred to;

the cosine value of the vector included angle of text semantic blocks A2 and B2 is referred to;

refers to the cosine value of the vector angle of text semantic blocks a2 and B2.

The embodiments described above are intended to facilitate one of ordinary skill in the art in understanding and using the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A method for calculating the text similarity of the traditional Chinese medical condition comprises the following steps:

(3) calculating the weight of the text semantic block;

(4) calculating text semantic blocking vectors;

(6) calculating text similarity according to the characteristics of the disease documents;

the method for calculating the text similarity in the step (6) comprises the following steps: measuring the symptom similarity degree of the same disease location by calculating the cosine similarity of the semantic block vectors, and weighting by adopting weight to obtain the similarity of the Chinese medical condition texts;

the cosine similarity calculation method comprises the following steps:

wherein, B_1pFor the text semantic block with the number p in the first Chinese medicine illness state text, B_2qSemantically partitioning a text numbered q in a second Chinese medicine disease state text; w is a_1pIs B_1pThe weight of (2); f (w)_1p,w_2q) Is Sim (vec (B)_1p),vec(B_2q) A weight of); | vec (B)_1p) I and vec (B)_2q) | is vec (B)_1p) And vec (B)_2q) The mold of (4);

Sim(vec(B_1p),vec(B_2q) Is a block vector vec (B)_1p) And vec (B)_2q) The cosine similarity of the text semantic blocks in the first Chinese medical condition text and the text semantic blocks in the second Chinese medical condition text are respectively m and n;

D₁as a first disease condition text, D₂As a second case text, w_2qIs B_2qI ∈ (1,2,3 …, m), j ∈ (1,2,3 …, n), w_1iIs the weight value of the ith text semantic block in the first disease text, w_2jAnd the weight of the jth text semantic block in the second disease condition text is obtained.

2. The method of calculating the similarity of texts according to the above mentioned claims 1, wherein the semantic blocks of texts comprise one or more phrases or sentences; the phrases or sentences in the text semantic blocks describe the same disease position, symptom or thing; and the positions of the phrases or sentences in the text semantic blocks are adjacent.

3. The method for calculating the similarity of texts of traditional Chinese medical conditions according to claim 1, wherein the method for recognizing short words and labeling phrases in step (2) comprises: and matching the words in the phrase with the words in the disease position word library, marking the phrase as a disease position phrase if the matching is successful, and otherwise, marking the phrase as a description phrase.

4. The method for calculating the similarity of the text of the traditional Chinese medical condition according to claim 1, wherein the weight of the text semantic block in the step (3) refers to the weight of the text semantic block in calculating the similarity of the document of the medical condition, and the text semantic block comprises disease position words; the text semantic blocking weight is represented by the weight value of the disease word; and the weight of the ill-positioned word is obtained by calculating the document frequency DF value of the ill-positioned word in the corpus.

5. The method for calculating the text similarity of the traditional Chinese medical conditions according to claim 1, wherein the method for calculating the text semantic blocking vector in the step (4) comprises the following steps:

(1) after the text is segmented, taking a text semantic segmentation block as a complete input of the Doc2vec, and carrying out word vector training to obtain a Doc2vec model;

(2) converting each text semantic block of the document into a corresponding vector through a Doc2vec model, thereby converting the whole illness state document into a block vector sequence and enabling w to be_mRepresenting text semantic blocks B_mWeight of (A), vec (B)_m) Representing text semantic blocks B_mThe feature vector of (2) is the disease condition document D_kFeature F (D)_k) Namely:

F(D_k)＝((w₁,vec(B₁)),(w₂,vec(B₂)),...(w_m,vec(B_m)))。