AU2021104693A4

AU2021104693A4 - An approach and device and system for extracting diseases and causes in medical texts

Info

Publication number: AU2021104693A4
Application number: AU2021104693A
Authority: AU
Inventors: Honghai Feng
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-09-30
Anticipated expiration: 2029-07-29

Abstract

The present invention discloses an approach and device and system for extracting diseases and causes in medical texts, specifically in the field of natural language processing information extraction. The present invention mainly includes a reading module, a compute module and a display module. The reading module mainly refers to the system reading some input medical texts. The calculation module mainly includes relation words extraction unit, diseases extraction unit, and causes extraction unit. It mainly includes the following steps: a. The system reads the accurate diseases and causes; b. Learning the relationship words through the correct diseases and causes; c. Learning the causes through the diseases and relation words; d. Learning the diseases through the causes and relation words. The extraction results are evaluated, and if there is an increase in the number of entities compared to the existing ones, the iteration continues, otherwise it ends. The display module mainly contains: storage unit, output unit. The invention takes the published unstructured medical text as data-source, and finally realizes the accurate extraction of diseases and causes in medical text. 1/2 DRAWINGS reading modul text input unit module e diseases Diseases extraction unit and causes )compute relation words extraction module extraction unit system causes extraction unt d memory unit display module mde display unit Figure 1: SystemBlock Diagram Reading medical text When the system matches the corresponding sentence pattern, it will first extract the relation words of the sentence pattern according to the disease and causes, and then remove and store them When the relation words of the sentence structure are extracted, the diseases in accordance with the sentence structure will be extracted, removed and stored according to the relation words and causes When the disease of the sentence structure is extracted, the cause of the sentence structure will be extracted according to the relation words and disease, and the impurity will be removed and stored Display of Causes Figure 2: Flow chart of diseases and causes extraction method

Description

1/2

DRAWINGS

reading modulemodul text input unit

e diseases Diseases extraction unit and )compute relation words causes extraction module extraction unit system causes extraction unt

d memory unit display module mde display unit

Figure 1: SystemBlock Diagram

Reading medical text

When the system matches the corresponding sentence pattern, it will first extract the relation words of the sentence pattern according to the disease and causes, and then remove and store them

When the relation words of the sentence structure are extracted, the diseases in accordance with the sentence structure will be extracted, removed and stored according to the relation words and causes

When the disease of the sentence structure is extracted, the cause of the sentence structure will be extracted according to the relation words and disease, and the impurity will be removed and stored

Display of Causes

Figure 2: Flow chart of diseases and causes extraction method

AN APPROACH AND DEVICE AND SYSTEM FOR EXTRACTING DISEASES AND CAUSES IN MEDICAL TEXTS FIELD OF INVENTION

[0001] The invention relates to the technical field of natural language processing information extraction.

BACKGROUND OF THE INVENTION

[0002] In recent years, a large number of medical texts have been accumulated. Medical texts mainly include

professional textbooks, professional medical websites, medical dictionaries, electronic cases, and papers in

medical research journals. These medical texts contain rich medical data, which mainly include information on

disease etiology, symptoms, treatment, diagnosis, and so on. However, most of these massive data exist in

semi-structured or unstructured forms, and current natural language processing or information extraction

techniques are not very mature for extracting complete and accurate information from unstructured texts. There

are already companies or products that are not yet able to extract disease causative factors and etiologies to an

accurate level of tens of thousands. The present invention focuses on analyzing common sentences in medical

texts, mathematizing the sentences, and designing an iterative algorithm and procedure that can iteratively obtain

tens of thousands of accurate disease ausative factors and etiologies from medical texts.

[0003] With the continuous development of computer, text mining system has been implemented. For example,

a text mining approach and system based on unstructured EMR includes text preprocessing module, feature

engineering module, analysis and prediction module. The main extracted features of the invention include

symptoms, examination findings, radiotherapy and chemotherapy scheme, curative effect evaluation, etc. The

patent uses time nodes to segment hospitalization records, extracts feature by extracting disease information from

rule base, and finally realizes text clustering through unsupervised clustering. The patent is based on time node

segmentation; the complete semantics of the sentence is not taken into account. The input text only includes the

medical records in the hospital database, and the range of data sources is small.

[0004] The medical domain faces many difficulties in the recognition task, mainly in the following aspects:

from the extraction process: the medical domain usually contains a rich class of entities; there are many different modifiers and qualifiers in the entity context thus making the boundaries of the entities harder to determine and delineate; the entities to be extracted usually exist in different more descriptive ways; the length of thecausative entities is usually harder to determine. From the extraction results: the number of extracted causal factors and etiologies is small, only a few thousand, more often tens of thousands, but not reaching the scale of tens to hundreds of thousands. The diseases involved are only a few thousand, not reaching the scale of tens of thousands up to tens of thousands.

OBJECTIVE OF THE INVENTION

[0005] The main objective is to provide a method and apparatus and system for extracting diseases and causes

in medical texts and to solve the problems presented in the above-mentioned background technology.

[0006] Another objective is to extract several disease and cause entities by the unstructured medicaltext.

SUMMARY OF THE INVENTION

[0007] To achieve the above purpose, the present invention provides a method for extracting diseases and

causes. This method mainly includes: Step 1: acquiring medical text and sentence structure of diseases and causes;

Step 2:acquiring relation words of each sentence structure by diseases and causes; Step 3: removing the relation

words and save them into the existing relation words set; Step 4: learning the causes of each sentence structure by

the diseases and relation words; Step 5:removing the impurity of causes and saving it into the cause set; Step 6:

learning the diseases by the causes and relation words; Step 7:removing the impurity of diseases and saving it into

the disease set.

[0008] Corresponding to the method, the invention also provides a disease and cause extraction system, The

system includes text input unit, diseases extraction unit, relation words extraction unit, causes extraction unit and

storage unit. A text input unit for the system is to read unstructured medical text; The disease extraction unit is

used to extract the disease entities through the relation words and the causes; The relation words extraction unit is

used to extract the relation words through diseases and causes; The causes extraction unit is us ed to extract cause

entities through relation words and diseases; The storage unit is used for the structural storage of the results.

[0009] Corresponding to the system, the present implementation of the invention provides a device for extraction of diseases and causes. The device comprises a memory, a processor and a computer program stored on the memory and can be run on the processor. When the processor executes the program, it realizes a system for extracting disease inducement and etiology from medical text. The embodiment of the invention provides a computer-readable storage medium, the computer can store a computer program, and when the program is executed by a processor, a system for extracting diseases and causes from medical texts is realized.

DETAILED DESCRIPTION OF THE INVENTION

[0010] Figurel: System Block Diagram.

[0011] Figure2: Flow chart of diseases and causes extraction method.

[0012] Figure3: The overall flow chart of the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0013] Figure 1 is the system block diagram of the system for extracting diseases and causes from medical text.

Firstly, the unstructured medical text is input to the system through the text input unit in the reading module. Then,

in the calculation module, the corresponding entity words are extracted by disease extraction unit, relation words

extraction unit and causes extraction unit. Finally, the extracted entities are structured and stored through the

storage unit in the display module.

[0014] The diseases and causes extraction device includes an acquisition device, a processor, a memory and a

computer program stored in the memory and running on the processor. When the processor executes the computer

program, it realizes the steps in the method of extracting the diseases and causes, such as the steps in the method

of extracting the diseases and causes shown in Fig. 2. When the processor executes the computer program, it

realizes the functions of the modules or units in the above device.

[0015] Figure 3 shows a flow chart of a method for extracting diseases and causes from medical texts. In this

method, the model is trained several times by iteration, and the threshold parameters are updated to get the

optimal model. This method can obtain the entity to be extracted by combining n-i semantic elements in the

sentence pattern, which not only improves the accuracy and accuracy of entity extraction, but also effectively

solves the problem that the length of entity.

Claims

1. An approach of extracting disease inducement and cause from medical text consists of: Obtain and store medical texts and sentence patterns; Extracting relation words and removing the clutter of relation words, the removed relation words are incorporated into the existing relation word set; The causes are extracted and verified, and the verified causes are incorporated into the cause set; The diseases are extracted and verified, and the verified diseases are incorporated into the disease set;

2. According to claim 1, the impurity removal of relation words is based on Hanlp word segmentation tool for word segmentation and stop words filtering, and then add specific threshold filter conditions for filtering.

3. According to claim 1, medical text acquisition is to obtain Chinese sentences in unstructured text through regular expressions.

4. According to claim 1, the disease and causes extraction model is trained many times through the idea of iteration. At the same time, the threshold parameter setting is introduced. Finally, the optimal model is obtained by adjusting the parameters.

5. The invention relates to a device for extracting diseases and causes from medical text, which comprises an extractor, a processor, a memory and a computer program stored in the memory and can be run on the processor. When the processor executes the computer program, it can implement the steps of any one of the methods in claims 1.

6. A system for extracting disease inducement and cause from medical text consists of: The text input unit Diseases extractionunit Relation words extractionunit Causes extraction unit Memory unit Display unit