AU2021104693A4 - An approach and device and system for extracting diseases and causes in medical texts - Google Patents
An approach and device and system for extracting diseases and causes in medical texts Download PDFInfo
- Publication number
- AU2021104693A4 AU2021104693A4 AU2021104693A AU2021104693A AU2021104693A4 AU 2021104693 A4 AU2021104693 A4 AU 2021104693A4 AU 2021104693 A AU2021104693 A AU 2021104693A AU 2021104693 A AU2021104693 A AU 2021104693A AU 2021104693 A4 AU2021104693 A4 AU 2021104693A4
- Authority
- AU
- Australia
- Prior art keywords
- causes
- diseases
- extraction
- relation words
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 61
- 201000010099 disease Diseases 0.000 title claims abstract description 59
- 238000013459 approach Methods 0.000 title claims abstract description 5
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 239000012535 impurity Substances 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims 2
- 230000014509 gene expression Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000004364 calculation method Methods 0.000 abstract description 2
- 238000005065 mining Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000001959 radiotherapy Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Multimedia (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Biomedical Technology (AREA)
- Primary Health Care (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention discloses an approach and device and system for extracting diseases and
causes in medical texts, specifically in the field of natural language processing information
extraction. The present invention mainly includes a reading module, a compute module and a
display module. The reading module mainly refers to the system reading some input medical texts.
The calculation module mainly includes relation words extraction unit, diseases extraction unit, and
causes extraction unit. It mainly includes the following steps: a. The system reads the accurate
diseases and causes; b. Learning the relationship words through the correct diseases and causes; c.
Learning the causes through the diseases and relation words; d. Learning the diseases through the
causes and relation words. The extraction results are evaluated, and if there is an increase in the
number of entities compared to the existing ones, the iteration continues, otherwise it ends. The
display module mainly contains: storage unit, output unit. The invention takes the published
unstructured medical text as data-source, and finally realizes the accurate extraction of diseases and
causes in medical text.
1/2
DRAWINGS
reading
modul text input unit module
e diseases
Diseases extraction unit
and
causes )compute relation
words
extraction module extraction unit
system causes extraction
unt
d memory unit
display
module
mde display unit
Figure 1: SystemBlock Diagram
Reading medical text
When the system matches the corresponding
sentence pattern, it will first extract the relation
words of the sentence pattern according to the
disease and causes, and then remove and store them
When the relation words of the sentence structure
are extracted, the diseases in accordance with the
sentence structure will be extracted, removed and
stored according to the relation words and causes
When the disease of the sentence structure is
extracted, the cause of the sentence structure will be
extracted according to the relation words and disease,
and the impurity will be removed and stored
Display of Causes
Figure 2: Flow chart of diseases and causes extraction method
Description
1/2
reading modulemodul text input unit
e diseases Diseases extraction unit and )compute relation words causes extraction module extraction unit system causes extraction unt
d memory unit display module mde display unit
Figure 1: SystemBlock Diagram
Reading medical text
When the system matches the corresponding sentence pattern, it will first extract the relation words of the sentence pattern according to the disease and causes, and then remove and store them
When the relation words of the sentence structure are extracted, the diseases in accordance with the sentence structure will be extracted, removed and stored according to the relation words and causes
When the disease of the sentence structure is extracted, the cause of the sentence structure will be extracted according to the relation words and disease, and the impurity will be removed and stored
Display of Causes
Figure 2: Flow chart of diseases and causes extraction method
[0001] The invention relates to the technical field of natural language processing information extraction.
[0002] In recent years, a large number of medical texts have been accumulated. Medical texts mainly include
professional textbooks, professional medical websites, medical dictionaries, electronic cases, and papers in
medical research journals. These medical texts contain rich medical data, which mainly include information on
disease etiology, symptoms, treatment, diagnosis, and so on. However, most of these massive data exist in
semi-structured or unstructured forms, and current natural language processing or information extraction
techniques are not very mature for extracting complete and accurate information from unstructured texts. There
are already companies or products that are not yet able to extract disease causative factors and etiologies to an
accurate level of tens of thousands. The present invention focuses on analyzing common sentences in medical
texts, mathematizing the sentences, and designing an iterative algorithm and procedure that can iteratively obtain
tens of thousands of accurate disease ausative factors and etiologies from medical texts.
[0003] With the continuous development of computer, text mining system has been implemented. For example,
a text mining approach and system based on unstructured EMR includes text preprocessing module, feature
engineering module, analysis and prediction module. The main extracted features of the invention include
symptoms, examination findings, radiotherapy and chemotherapy scheme, curative effect evaluation, etc. The
patent uses time nodes to segment hospitalization records, extracts feature by extracting disease information from
rule base, and finally realizes text clustering through unsupervised clustering. The patent is based on time node
segmentation; the complete semantics of the sentence is not taken into account. The input text only includes the
medical records in the hospital database, and the range of data sources is small.
[0004] The medical domain faces many difficulties in the recognition task, mainly in the following aspects:
from the extraction process: the medical domain usually contains a rich class of entities; there are many different modifiers and qualifiers in the entity context thus making the boundaries of the entities harder to determine and delineate; the entities to be extracted usually exist in different more descriptive ways; the length of thecausative entities is usually harder to determine. From the extraction results: the number of extracted causal factors and etiologies is small, only a few thousand, more often tens of thousands, but not reaching the scale of tens to hundreds of thousands. The diseases involved are only a few thousand, not reaching the scale of tens of thousands up to tens of thousands.
[0005] The main objective is to provide a method and apparatus and system for extracting diseases and causes
in medical texts and to solve the problems presented in the above-mentioned background technology.
[0006] Another objective is to extract several disease and cause entities by the unstructured medicaltext.
[0007] To achieve the above purpose, the present invention provides a method for extracting diseases and
causes. This method mainly includes: Step 1: acquiring medical text and sentence structure of diseases and causes;
Step 2:acquiring relation words of each sentence structure by diseases and causes; Step 3: removing the relation
words and save them into the existing relation words set; Step 4: learning the causes of each sentence structure by
the diseases and relation words; Step 5:removing the impurity of causes and saving it into the cause set; Step 6:
learning the diseases by the causes and relation words; Step 7:removing the impurity of diseases and saving it into
the disease set.
[0008] Corresponding to the method, the invention also provides a disease and cause extraction system, The
system includes text input unit, diseases extraction unit, relation words extraction unit, causes extraction unit and
storage unit. A text input unit for the system is to read unstructured medical text; The disease extraction unit is
used to extract the disease entities through the relation words and the causes; The relation words extraction unit is
used to extract the relation words through diseases and causes; The causes extraction unit is us ed to extract cause
entities through relation words and diseases; The storage unit is used for the structural storage of the results.
[0009] Corresponding to the system, the present implementation of the invention provides a device for extraction of diseases and causes. The device comprises a memory, a processor and a computer program stored on the memory and can be run on the processor. When the processor executes the program, it realizes a system for extracting disease inducement and etiology from medical text. The embodiment of the invention provides a computer-readable storage medium, the computer can store a computer program, and when the program is executed by a processor, a system for extracting diseases and causes from medical texts is realized.
[0010] Figurel: System Block Diagram.
[0011] Figure2: Flow chart of diseases and causes extraction method.
[0012] Figure3: The overall flow chart of the invention.
[0013] Figure 1 is the system block diagram of the system for extracting diseases and causes from medical text.
Firstly, the unstructured medical text is input to the system through the text input unit in the reading module. Then,
in the calculation module, the corresponding entity words are extracted by disease extraction unit, relation words
extraction unit and causes extraction unit. Finally, the extracted entities are structured and stored through the
storage unit in the display module.
[0014] The diseases and causes extraction device includes an acquisition device, a processor, a memory and a
computer program stored in the memory and running on the processor. When the processor executes the computer
program, it realizes the steps in the method of extracting the diseases and causes, such as the steps in the method
of extracting the diseases and causes shown in Fig. 2. When the processor executes the computer program, it
realizes the functions of the modules or units in the above device.
[0015] Figure 3 shows a flow chart of a method for extracting diseases and causes from medical texts. In this
method, the model is trained several times by iteration, and the threshold parameters are updated to get the
optimal model. This method can obtain the entity to be extracted by combining n-i semantic elements in the
sentence pattern, which not only improves the accuracy and accuracy of entity extraction, but also effectively
solves the problem that the length of entity.
Claims (6)
1. An approach of extracting disease inducement and cause from medical text consists of: Obtain and store medical texts and sentence patterns; Extracting relation words and removing the clutter of relation words, the removed relation words are incorporated into the existing relation word set; The causes are extracted and verified, and the verified causes are incorporated into the cause set; The diseases are extracted and verified, and the verified diseases are incorporated into the disease set;
2. According to claim 1, the impurity removal of relation words is based on Hanlp word segmentation tool for word segmentation and stop words filtering, and then add specific threshold filter conditions for filtering.
3. According to claim 1, medical text acquisition is to obtain Chinese sentences in unstructured text through regular expressions.
4. According to claim 1, the disease and causes extraction model is trained many times through the idea of iteration. At the same time, the threshold parameter setting is introduced. Finally, the optimal model is obtained by adjusting the parameters.
5. The invention relates to a device for extracting diseases and causes from medical text, which comprises an extractor, a processor, a memory and a computer program stored in the memory and can be run on the processor. When the processor executes the computer program, it can implement the steps of any one of the methods in claims 1.
6. A system for extracting disease inducement and cause from medical text consists of: The text input unit Diseases extractionunit Relation words extractionunit Causes extraction unit Memory unit Display unit
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021104693A AU2021104693A4 (en) | 2021-07-29 | 2021-07-29 | An approach and device and system for extracting diseases and causes in medical texts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021104693A AU2021104693A4 (en) | 2021-07-29 | 2021-07-29 | An approach and device and system for extracting diseases and causes in medical texts |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021104693A4 true AU2021104693A4 (en) | 2021-09-30 |
Family
ID=77857715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021104693A Ceased AU2021104693A4 (en) | 2021-07-29 | 2021-07-29 | An approach and device and system for extracting diseases and causes in medical texts |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2021104693A4 (en) |
-
2021
- 2021-07-29 AU AU2021104693A patent/AU2021104693A4/en not_active Ceased
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111274806B (en) | Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
CN107122340B (en) | A kind of similarity detection method of the science and technology item return based on synonym analysis | |
CN108108426B (en) | Understanding method and device for natural language question and electronic equipment | |
CN110931128B (en) | Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts | |
CN109920540A (en) | Construction method, device and the computer equipment of assisting in diagnosis and treatment decision system | |
JP2022042497A (en) | Automatically generating pipeline of new machine learning project from pipeline of existing machine learning project stored in corpus | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN106528527A (en) | Identification method and identification system for out of vocabularies | |
US20220067054A1 (en) | Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects | |
CN103440315A (en) | Web page cleaning method based on theme | |
CN115578137A (en) | Agricultural product future price prediction method and system based on text mining and deep learning model | |
JP2020106880A (en) | Information processing apparatus, model generation method and program | |
CN113127607A (en) | Text data labeling method and device, electronic equipment and readable storage medium | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN112151186A (en) | Method, device and system for extracting disease causes and disease causes from medical texts | |
CN115130038A (en) | Webpage classification method and device | |
JP2022042496A (en) | Automatic labeling of function blocks in pipeline of existing machine learning project in corpuses applicable for use in new machine learning project | |
CN116663536B (en) | Matching method and device for clinical diagnosis standard words | |
CN116579429A (en) | Building environment knowledge graph construction method and device | |
AU2021104693A4 (en) | An approach and device and system for extracting diseases and causes in medical texts | |
CN113157946B (en) | Entity linking method, device, electronic equipment and storage medium | |
CN112988999B (en) | Method, device, equipment and storage medium for constructing Buddha study answer pairs | |
CN111341404B (en) | Electronic medical record data set analysis method and system based on ernie model | |
JP2021099805A (en) | Device and method for processing digital data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |