CN108647203B - Method for calculating text similarity of traditional Chinese medicine disease conditions - Google Patents

Method for calculating text similarity of traditional Chinese medicine disease conditions Download PDF

Info

Publication number
CN108647203B
CN108647203B CN201810359667.9A CN201810359667A CN108647203B CN 108647203 B CN108647203 B CN 108647203B CN 201810359667 A CN201810359667 A CN 201810359667A CN 108647203 B CN108647203 B CN 108647203B
Authority
CN
China
Prior art keywords
text
disease
calculating
semantic
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810359667.9A
Other languages
Chinese (zh)
Other versions
CN108647203A (en
Inventor
姜晓红
付钊
陈广
杜定益
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810359667.9A priority Critical patent/CN108647203B/en
Publication of CN108647203A publication Critical patent/CN108647203A/en
Application granted granted Critical
Publication of CN108647203B publication Critical patent/CN108647203B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for calculating text similarity of traditional Chinese medical conditions, which comprises the following steps: obtaining a text block based on rule and statistical phrase identification; dividing text blocks to obtain text semantic blocks; calculating the weight of the text semantic block; calculating text semantic blocking vectors; combining the text semantic blocking features to obtain disease condition document features; and calculating the text similarity according to the characteristics of the disease documents. The method takes the text semantic blocks as the minimum granularity to express the characteristics of the disease condition text, divides the disease condition text into the text semantic blocks according to the described disease positions, gives different weights to each text semantic block to distinguish primary symptoms and secondary symptoms, finds out the similar symptoms of the two sections of disease condition texts by calculating the cosine value of the vector included angle of the text semantic blocks, and finally weights according to the weights to obtain the similarity of the two sections of disease condition texts, thereby overcoming the defects that the traditional text similarity calculation method loses semantic information or can not highlight the primary and secondary causes of disease.

Description

Method for calculating text similarity of traditional Chinese medicine disease conditions
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method for calculating text similarity of traditional Chinese medicine conditions.
Background
The traditional Chinese medicine dialectical diagnosis usually adopts methods of assisting physical classification, probing and countering syndromes, and the description of the state of an illness is mostly obtained by looking at, smelling, asking and cutting, and looking at the spirit, the complexion, the form, the local part, the excrement and the tongue; listening to sound and smelling smell; ask for chills and fever, ask for sweat, ask for pain, ask for diet and taste, ask for sleep, ask for stool, ask for menstruation and leukorrhagia, ask for children; the pulse-taking and palpation can be used to obtain the description of the patient's condition and record it as the disease condition.
The traditional Chinese medicine disease description text generally has the following characteristics:
1) the description text is longer. The description of the traditional Chinese medicine on the disease condition comprises various information such as physical symptom expression, daily life and the like, and the description text of the traditional Chinese medicine is often more than hundreds of characters and belongs to a longer text;
2) comprises a plurality of disease position symptoms. The dialectical traditional Chinese medicine knows the disease condition of a patient by means of looking at, smelling, asking and cutting, and describes the disease condition including the symptom expression of each part of the body;
3) relying heavily on semantic information. The traditional Chinese medicine disease text contains a plurality of descriptive sentences for whether symptoms exist on body parts, the sentences depend on semantic information, for example, eyelid edema and eyelid edema do not exist, the semantics are completely opposite by one word;
4) the text is interspersed with some verification data. With the development of science and technology, traditional Chinese medicine also starts to perform physical examination on patients by means of some instruments, such as body temperature, heart rate and the like, and the examination results are mixed in a disease text in a digital form.
In the traditional text similarity calculation method, a bag-of-words model and TF-IDF characteristics are adopted, or domain semantics and subject word characteristics are adopted, so that text semantic information is lost or the semantic information is too simple.
Patent document No. CN103617157A discloses a semantic-based text similarity calculation method, and relates to the technical field of text-oriented intelligent information processing. The method aims to solve the problem that the conventional text vector space model and cosine similarity can not be subjected to semantic correlation judgment. The semantic-based text similarity calculation comprises the following steps: preprocessing a text set, extracting initial characteristic words, and expressing the initial characteristic words into a vector model consisting of keywords and concepts; and then respectively calculating the semantic similarity of the keyword part and the semantic similarity of the concept part, and summing the two parts to finally obtain the semantic similarity of the text.
Disclosure of Invention
The invention aims to provide a method for calculating the similarity of Chinese medical condition texts, which uses text semantic blocks as minimum granularity to represent the characteristics of the condition texts, and calculates the similarity of the two condition texts by calculating cosine values of the included angles of text semantic block vectors of the same disease position in the two Chinese medical condition texts and weighting the cosine values.
A method for calculating the text similarity of the traditional Chinese medical condition comprises the following steps:
(1) based on the phrase identification of rules and statistics, text blocks are obtained from the original Chinese medical condition text: loading a traditional Chinese medicine glossary to a word segmentation toolkit, and segmenting words of the original traditional Chinese medicine illness state text by using a word segmentation tool; removing stop words in the word segmentation result by adopting a stop word library; performing word co-occurrence probability calculation, and combining two words into a phrase to obtain a text block when the parts of speech of the two words accord with a Chinese phrase rule and the co-occurrence probability is greater than a given threshold value;
(2) dividing text blocks to obtain text semantic blocks: carrying out phrase identification and phrase marking on the text block in the step (1) to obtain a disease position phrase and a description phrase, and combining the disease position phrase and the description phrase to obtain a text semantic block;
(3) calculating the weight of the text semantic block;
(4) calculating text semantic blocking vectors;
(5) combining the weights of the text semantic blocks and the text semantic block vectors respectively obtained in the steps (3) and (4) to obtain text semantic block characteristics, and combining a plurality of text semantic block characteristics to obtain illness state document characteristics;
(6) and calculating the text similarity according to the characteristics of the disease documents.
The text semantic blocking refers to a block formed by a plurality of adjacent phrases or sentences describing the same thing, disease position or symptom, and the granularity of the block is larger than that of the phrase and smaller than that of the segment; the granularity refers to the number of contained Chinese characters.
The text semantic chunk comprises one or more phrases or sentences; the phrases or sentences in the text semantic blocks describe the same disease position, symptom or thing; and the positions of the phrases or sentences in the text semantic blocks are adjacent.
The word co-occurrence probability calculation method in the step (1) comprises the following steps:
suppose { T }1,T2,T3,...TnThe results after all text word segmentation are shown, wherein Ti、Ti+1Is a word, n is the total number of words in the word segmentation result, TiConsisting of one or more words, denoted w1w2..wmThe algorithm comprises the following steps:
dividing the text after word segmentation into binary groups according to the way of dividing adjacent words into a group in pairs,wherein each binary group is as follows: t isiTi+1
Counting the frequency P (T) of each word in the word segmentation result and counting each binary group TiTi+1Frequency of occurrence P (T)iTi+1);
Is calculated at the word TiProbability of occurrence of each word in case of occurrence of (i ∈ 1, 2.. n).
The method for combining phrases in the step (1) comprises the following steps: traversing the word segmentation result to conform to the part-of-speech collocation rule of Chinese phrases and P (T)i+1|Ti) Word strings greater than a given threshold α are merged into phrases.
The method for recognizing the short words and marking the phrases in the step (2) comprises the following steps: matching the words in the Phrase with the words in the disease Position word library, if the matching is successful, marking the Phrase as a disease Position Phrase (PP), otherwise, marking the Phrase as a description Phrase; the Description Phrase refers to a Description Phrase (DP) for the symptoms of the disease location.
The disease position word library comprises disease position words in nine major systems of a human body motion system, a digestive system, a respiratory system, a urinary system, a reproductive system, an endocrine system, an immune system, a nervous system and a circulatory system.
In order to correctly label sentences that do not describe any symptoms of the disease, the first phrase after the period (i.e., the first phrase of the next sentence) is labeled PP, and finally, the text is labeled as follows:
Dk={PP1,DP11,DP12,...DP1mPPi,DPi1,DPi2,...DPin}
wherein D iskFor the kth document, PPiIs the ith disease phrase, DPijThe ith descriptive phrase is the jth descriptive phrase following the ith pathological phrase. Then the PP is mixediAnd the following disease phrase DPij(j ═ 1,2,. n) are combined into blocks, i.e. the text semantic blocks B are obtainedi
The weight of the text semantic block in the step (3) refers to the weight of the text semantic block in calculating the similarity of the disease documents, and the text semantic block comprises disease position words; the weight value of the text semantic block is represented by the weight value of the disease word; and the weight of the ill-positioned word is obtained by calculating the document frequency DF value of the ill-positioned word in the corpus.
A disease whose symptomatic expression includes a primary symptom and a secondary symptom, the primary symptom being a symptom that the disease must exhibit, and the secondary symptom being a complication that may be caused by the disease. Therefore, for similarity calculation of disease condition texts, the primary and secondary status of symptoms need to be considered, and cannot be considered in a general way. For example, the main symptom of a cold is fever, while cough is a secondary symptom, and "fever" and "fever without cough" are described for the cases of two cold patients, and in calculating the similarity, if the primary and secondary are not considered, the description similarity of the two cases is very low, but actually, the cases are both colds and are very similar.
For example, if the corpus contains N original chinese medical condition texts, the document frequency DF (document frequency) value of each disease word in the N texts can be calculated, and the higher the DF value is, the more likely the doctor tends to ask the symptom of the disease, the more likely the symptom is the main symptom for distinguishing the disease type. Weight wiThe calculation formula is as follows:
Figure BDA0001635629390000051
wi=dfi
wherein n isiFor the number of texts appearing in the corpus of the word i, α is the basic weight, i.e., the weight of the text semantic block that does not contain any ill-posed word.
The text semantic blocking vector calculation method in the step (4) comprises the following steps:
(4-1) after segmenting the text, taking a text semantic segment as a complete input of the Doc2vec, and carrying out word vector training to obtain a Doc2vec model;
(4-2) converting each text semantic block of the document into a corresponding direction through a Doc2vec modelQuantity, whereby the entire disease document is converted into a sequence of block vectors, let wmRepresenting text semantic blocks BmWeight of (A), vec (B)m) Representing text semantic blocks BmThe feature vector of (2) is the disease condition document DkFeature F (D)k) Namely:
F(Dk)=((w1,vec(B1)),(w2,vec(B2)),...(wm,vec(Bm)))。
for example for document DkLet D bekContaining m semantic blocks of text, i.e. Dk={B1,B2,B3,...BmIn which B isiFor text blocks i, BiComposed of several sentences or phrases, BiThe corresponding text block is characterized by F (B)i) Then, there are:
F(Bi)=(wBi,vec(Bi))
wherein
Figure BDA0001635629390000063
Blocking B for text semanticsiWeight of (A), vec (B)i) Blocking B for text semanticsiThe feature vector of (2); document D of the disease conditionkFeature F (D)k) Can be written as:
F(Dk)=(F(B1),F(B2),F(B3),...F(Bm))。
the method for calculating the text similarity in the step (6) measures the symptom similarity of the same disease location by calculating the cosine similarity of the text semantic block vectors, and weights are adopted to obtain the similarity of the Chinese medical condition texts.
The cosine similarity calculation method comprises the following steps:
Figure BDA0001635629390000061
Figure BDA0001635629390000062
wherein, B1pFor the text semantic block with the number p in the first Chinese medicine illness state text, B2qSemantically partitioning a text numbered q in a second Chinese medicine disease state text; vec (B)1p),vec(B2q) Are respectively B1pAnd B2qBlock vector of w1pIs B1pThe weight of (2); f (w)1p,w2q) Is Sim (vec (B)1p),vec(B2q) A weight of); | vec (B)1p) I and vec (B)2q) | is vec (B)1p) And vec (B)2q) The die of (1).
f(w1p,w2q) The value meaning is as follows: when block B1pAnd block B2qIf the description is to the same disease position, the cosine included angle of the vector corresponding to the block is calculated; otherwise, when the block B is1pWeight value w of1pNot equal to block B2qWeight value w of2qIn time, it is shown that the two blocks are not descriptions of the same disease location, and therefore there is no value in calculating similarity.
The calculation method for obtaining the similarity of the Chinese medical condition texts by weighting the weights comprises the following steps:
Figure BDA0001635629390000071
Sim(vec(B1p),vec(B2q) Is a block vector vec (B)1p) And vec (B)2q) The cosine similarity of the text semantic blocks in the first and second Chinese medical condition texts is m and n respectively.
Suppose two Chinese medicine condition texts D1And D2Calculated block vector feature F (D)1) And F (D)2) Is represented as follows:
F(D1)=((w11,vec(B11)),(w12,vec(B12)),...(w1m,vec(B1m)))
F(D2)=((w21,vec(B21)),(w22,vec(B22)),...(w2n,vec(B2n)))
wherein, BijText for the ith medical conditionText semantic Block with number j, wijBlocking B for text semanticsijWeight of (A), vec (B)ij) Blocking B for text semanticsijThe block vectors m and n are the number of text semantic blocks in the Chinese medical condition texts 1 and 2 respectively.
Text of the traditional Chinese medical conditions D1And D2The similarity degree of symptoms of the same disease location can be measured by calculating the cosine similarity of the semantic block vectors, and the similarity degree of the Chinese medical condition text is obtained by weighting by adopting the weight.
The invention relates to a Chinese medical condition text, which comprises a plurality of symptom descriptions of disease positions, wherein one symptom description of a disease position is mostly corresponding to one or more phrases or sentences, and one condition text can be regarded as a set consisting of text semantic block description sentences of a plurality of disease position symptoms.
The method for calculating the similarity of the Chinese medical condition texts overcomes the defects that the traditional text similarity calculation method loses semantic information or cannot highlight the primary and secondary causes of diseases, expresses the characteristics of the condition texts by taking the text semantic blocks of the disease positions as the minimum granularity, and calculates the similarity of the two condition texts by calculating the cosine value of the included angle of the semantic block vectors of the same disease positions in the two condition texts and weighting.
Drawings
FIG. 1 is a schematic flow chart of a calculation method provided by the present invention;
fig. 2 is a schematic diagram of a specific process for obtaining text semantic partitions in the calculation method provided by the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments. It is to be understood that such description is merely illustrative of the features and advantages of the present invention, and is not intended to limit the scope of the claims.
As shown in fig. 1, a method for calculating the similarity of the text of the condition of traditional Chinese medicine comprises the following steps.
(1) And obtaining a text block based on the phrase recognition of the rule and the statistics.
The specific flow is shown in fig. 2, a traditional Chinese medicine glossary is loaded to a word segmentation tool bag, a word segmentation tool is used for segmenting words of an original traditional Chinese medicine disease text, and stop words in a word segmentation result are removed by adopting a stop word lexicon; and (4) performing word co-occurrence probability calculation, and combining the two words into a phrase to obtain a text block when the part of speech of the two words accords with the Chinese phrase rule and the co-occurrence probability is greater than a given threshold value.
(2) And dividing the text blocks to obtain text semantic blocks.
And performing phrase identification and phrase marking to obtain a disease position phrase and a description phrase, and combining the disease position phrase and the description phrase to obtain a text semantic block.
(3) And calculating the weight of the text semantic block.
(4) And calculating text semantic blocking vectors.
(5) Combining the text semantic blocking features to obtain the disease condition document features.
(6) And calculating the text similarity according to the characteristics of the disease documents.
Suppose that two sections of illness state texts, a text A and a text B, are provided, the contents are respectively as follows:
text a:
the jugular vein has no anger, red throat, no swelling of tonsil, and slightly coarse respiratory tone of both lungs, and has no obvious dry and wet rale; the abdomen is soft, no tenderness and rebound pain, no percussion pain on the kidneys, and no swelling on the lower limbs.
Text B:
hot and face, red and congested pharynx, unsmooth and obvious swelling of tonsil, coarse respiratory sounds of two lungs, unsmooth and obvious dry and wet rales and increased texture.
And carrying out similarity calculation on the text A and the text B.
1. Obtaining the following text through word segmentation and stop word segmentation in the step (1), wherein an Ansj word segmentation tool is adopted for word segmentation:
text a:
the jugular vein has no anger, red throat, no swelling of tonsil, coarse respiratory tone of both lungs, and no dry moist rale; the abdomen is soft, there is no tenderness and pain, the pain of the kidney is not knocked out, and the lower limbs are not swollen.
Text B:
hot face, red throat, congestion, non-swollen tonsil, coarse respiratory sounds of the two lungs, non-dry and wet rale and increased texture.
2. Performing word combination through the Chinese phrase rule and word co-occurrence probability calculation in the step (1), and obtaining the following results:
text a:
{ jugular vein } { no anger, { pharynx red, } { tonsil } { no swelling, and { double lung } { breath sound coarse, } { no } { dry and wet rale; the patient can be treated by the following steps of { belly softness, } { no pressure pain } { rebound pain, } { double kidneys } { no tapping pain, } { double lower limbs } { no swelling. }
Text B:
{ hot appearance, } { pharynx red } { congestion, } { tonsil } { non-swelling, } { two lungs } { breath sound coarse, } { non } { dry and wet rale, } { texture increase. }
3. Obtaining disease location phrases and description phrases through phrase identification and phrase marking in the step (2), combining the disease location phrases and the description phrases to obtain text semantic blocks, wherein jugular veins, pharynx, tonsil, lung, abdomen, kidney and lower limbs all belong to disease location words, and obtaining results are as follows:
text a:
{ jugular vein has no anger, { pharynx red, } { tonsil has no swelling, and { bipulmonary breath is coarse and wet rale is not dry; the abdomen is soft, there is no pain and the pain is got back to jumping, { two kidneys have no pain of knocking, } two lower limbs are not swollen. }
Text B:
{ hot face, } { pharyngeal red congestion, } { tonsil not swollen, } { two lung breath sound is coarse, not dry and wet, and texture is increased. }
4. After the text blocks are divided, the text A comprises text semantic blocks with 7 disease positions, the text B comprises text semantic blocks with 4 disease positions, and if the corpus only comprises two texts A and B, α is 1, the number, weight and vector of each block are shown in table 1, wherein the text semantic block vector depends on the corpus of the Doc2vec training model and needs to be calculated according to the actual corpus.
TABLE 1 weights and vectors for text semantic segmentation in text A and text B
Figure BDA0001635629390000111
5. And combining the text semantic blocking features in the text A and the text B in the table 1 to obtain the disease condition document features.
6. And calculating the text similarity according to the characteristics of the disease documents.
The similarity of the text a and the text B is calculated according to the following formula.
Figure BDA0001635629390000112
In the text semantic blocks of the text A and the text B, A2 and B2 are text semantic blocks with the same disease position, A3 and B3 are text semantic blocks with the same disease position, and A4 and B4 are text semantic blocks with the same disease position.
Then
Figure BDA0001635629390000121
Wherein:
Figure BDA0001635629390000122
the cosine value of the included angle between the vector of the text semantic block A2 and the vector of the vector B2 is referred to;
Figure BDA0001635629390000123
the cosine value of the vector included angle of text semantic blocks A2 and B2 is referred to;
Figure BDA0001635629390000124
refers to the cosine value of the vector angle of text semantic blocks a2 and B2.
The embodiments described above are intended to facilitate one of ordinary skill in the art in understanding and using the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims (5)

1. A method for calculating the text similarity of the traditional Chinese medical condition comprises the following steps:
(1) based on the phrase identification of rules and statistics, text blocks are obtained from the original Chinese medical condition text: loading a traditional Chinese medicine glossary to a word segmentation toolkit, and segmenting words of the original traditional Chinese medicine illness state text by using a word segmentation tool; removing stop words in the word segmentation result by adopting a stop word library; performing word co-occurrence probability calculation, and combining two words into a phrase to obtain a text block when the parts of speech of the two words accord with a Chinese phrase rule and the co-occurrence probability is greater than a given threshold value;
(2) dividing text blocks to obtain text semantic blocks: carrying out phrase identification and phrase marking on the text block in the step (1) to obtain a disease position phrase and a description phrase, and combining the disease position phrase and the description phrase to obtain a text semantic block;
(3) calculating the weight of the text semantic block;
(4) calculating text semantic blocking vectors;
(5) combining the weights of the text semantic blocks and the text semantic block vectors respectively obtained in the steps (3) and (4) to obtain text semantic block characteristics, and combining a plurality of text semantic block characteristics to obtain illness state document characteristics;
(6) calculating text similarity according to the characteristics of the disease documents;
the method for calculating the text similarity in the step (6) comprises the following steps: measuring the symptom similarity degree of the same disease location by calculating the cosine similarity of the semantic block vectors, and weighting by adopting weight to obtain the similarity of the Chinese medical condition texts;
the cosine similarity calculation method comprises the following steps:
Figure FDA0002413880920000011
Figure FDA0002413880920000021
wherein, B1pFor the text semantic block with the number p in the first Chinese medicine illness state text, B2qSemantically partitioning a text numbered q in a second Chinese medicine disease state text; w is a1pIs B1pThe weight of (2); f (w)1p,w2q) Is Sim (vec (B)1p),vec(B2q) A weight of); | vec (B)1p) I and vec (B)2q) | is vec (B)1p) And vec (B)2q) The mold of (4);
the calculation method for obtaining the similarity of the Chinese medical condition texts by weighting the weights comprises the following steps:
Figure FDA0002413880920000022
Sim(vec(B1p),vec(B2q) Is a block vector vec (B)1p) And vec (B)2q) The cosine similarity of the text semantic blocks in the first Chinese medical condition text and the text semantic blocks in the second Chinese medical condition text are respectively m and n;
D1as a first disease condition text, D2As a second case text, w2qIs B2qI ∈ (1,2,3 …, m), j ∈ (1,2,3 …, n), w1iIs the weight value of the ith text semantic block in the first disease text, w2jAnd the weight of the jth text semantic block in the second disease condition text is obtained.
2. The method of calculating the similarity of texts according to the above mentioned claims 1, wherein the semantic blocks of texts comprise one or more phrases or sentences; the phrases or sentences in the text semantic blocks describe the same disease position, symptom or thing; and the positions of the phrases or sentences in the text semantic blocks are adjacent.
3. The method for calculating the similarity of texts of traditional Chinese medical conditions according to claim 1, wherein the method for recognizing short words and labeling phrases in step (2) comprises: and matching the words in the phrase with the words in the disease position word library, marking the phrase as a disease position phrase if the matching is successful, and otherwise, marking the phrase as a description phrase.
4. The method for calculating the similarity of the text of the traditional Chinese medical condition according to claim 1, wherein the weight of the text semantic block in the step (3) refers to the weight of the text semantic block in calculating the similarity of the document of the medical condition, and the text semantic block comprises disease position words; the text semantic blocking weight is represented by the weight value of the disease word; and the weight of the ill-positioned word is obtained by calculating the document frequency DF value of the ill-positioned word in the corpus.
5. The method for calculating the text similarity of the traditional Chinese medical conditions according to claim 1, wherein the method for calculating the text semantic blocking vector in the step (4) comprises the following steps:
(1) after the text is segmented, taking a text semantic segmentation block as a complete input of the Doc2vec, and carrying out word vector training to obtain a Doc2vec model;
(2) converting each text semantic block of the document into a corresponding vector through a Doc2vec model, thereby converting the whole illness state document into a block vector sequence and enabling w to bemRepresenting text semantic blocks BmWeight of (A), vec (B)m) Representing text semantic blocks BmThe feature vector of (2) is the disease condition document DkFeature F (D)k) Namely:
F(Dk)=((w1,vec(B1)),(w2,vec(B2)),...(wm,vec(Bm)))。
CN201810359667.9A 2018-04-20 2018-04-20 Method for calculating text similarity of traditional Chinese medicine disease conditions Active CN108647203B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810359667.9A CN108647203B (en) 2018-04-20 2018-04-20 Method for calculating text similarity of traditional Chinese medicine disease conditions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810359667.9A CN108647203B (en) 2018-04-20 2018-04-20 Method for calculating text similarity of traditional Chinese medicine disease conditions

Publications (2)

Publication Number Publication Date
CN108647203A CN108647203A (en) 2018-10-12
CN108647203B true CN108647203B (en) 2020-07-07

Family

ID=63746741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810359667.9A Active CN108647203B (en) 2018-04-20 2018-04-20 Method for calculating text similarity of traditional Chinese medicine disease conditions

Country Status (1)

Country Link
CN (1) CN108647203B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831560B (en) * 2018-06-21 2020-09-22 北京嘉和海森健康科技有限公司 Method and device for determining medical data attribute data
CN109977406A (en) * 2019-03-26 2019-07-05 浙江大学 A kind of Chinese medicine state of an illness text key word extracting method based on sick position
CN111341437B (en) * 2020-02-21 2022-02-11 山东大学齐鲁医院 Digestive tract disease judgment auxiliary system based on tongue image
CN112349423B (en) * 2020-11-04 2024-05-24 吾征智能技术(北京)有限公司 BiMPM method-based mouth drying information matching system
CN117648409B (en) * 2024-01-30 2024-04-05 北京点聚信息技术有限公司 OCR-based format file anti-counterfeiting recognition method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003122845A (en) * 2001-10-09 2003-04-25 Shinkichi Himeno Retrieval system for medical information, and program for carrying out the system
CN102214232A (en) * 2011-06-28 2011-10-12 东软集团股份有限公司 Method and device for calculating similarity of text data
CN103617157B (en) * 2013-12-10 2016-08-17 东北师范大学 Based on semantic Text similarity computing method
CN104636430B (en) * 2014-12-30 2018-03-13 东软集团股份有限公司 Case base represents and case similarity acquisition methods and system
CN106021871A (en) * 2016-05-10 2016-10-12 深圳前海信息技术有限公司 Disease similarity calculation method and device based on big data group behaviors

Also Published As

Publication number Publication date
CN108647203A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647203B (en) Method for calculating text similarity of traditional Chinese medicine disease conditions
CN110176315B (en) Medical question-answering method and system, electronic equipment and computer readable medium
CN111079377B (en) Method for recognizing named entities of Chinese medical texts
WO2018214486A1 (en) Method and apparatus for generating multi-document summary, and terminal
CN105069123B (en) A kind of automatic coding and system of Chinese surgical procedure information
CN108549639A (en) Based on the modified Chinese medicine case name recognition methods of multiple features template and system
CN110069631A (en) A kind of text handling method, device and relevant device
CN109522546A (en) Entity recognition method is named based on context-sensitive medicine
CN110069779B (en) Symptom entity identification method of medical text and related device
CN108628824A (en) A kind of entity recognition method based on Chinese electronic health record
US11989518B2 (en) Normalized processing method and apparatus of named entity, and electronic device
CN105184053B (en) A kind of automatic coding and system of Chinese medical service item information
CN110134951B (en) Method and system for analyzing text data potential subject phrases
CN110427486B (en) Body condition text classification method, device and equipment
Soysal et al. Design and evaluation of an ontology based information extraction system for radiological reports
KR102298330B1 (en) System for generating medical consultation summary and electronic medical record based on speech recognition and natural language processing algorithm
CN109102899A (en) Chinese medicine intelligent assistance system and method based on machine learning and big data
WO2020211250A1 (en) Entity recognition method and apparatus for chinese medical record, device and storage medium
CN115659954A (en) Composition automatic scoring method based on multi-stage learning
CN111651991A (en) Medical named entity identification method utilizing multi-model fusion strategy
CN116092699A (en) Cancer question-answer interaction method based on pre-training model
CN109033320A (en) A kind of bilingual news Aggreagation method and system
Yu et al. Identification of pediatric respiratory diseases using a fine-grained diagnosis system
CN112037909A (en) Diagnostic information rechecking system
Feng et al. A Chinese question answering system in medical domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant