CN109524068A - A kind of disease symptoms extracting method based on AC automatic machine - Google Patents

A kind of disease symptoms extracting method based on AC automatic machine Download PDF

Info

Publication number
CN109524068A
CN109524068A CN201811201375.9A CN201811201375A CN109524068A CN 109524068 A CN109524068 A CN 109524068A CN 201811201375 A CN201811201375 A CN 201811201375A CN 109524068 A CN109524068 A CN 109524068A
Authority
CN
China
Prior art keywords
word
health record
electronic health
symptom
automatic machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811201375.9A
Other languages
Chinese (zh)
Inventor
李继云
王天磊
孙莉
俞捷
林靖生
乐嘉锦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201811201375.9A priority Critical patent/CN109524068A/en
Publication of CN109524068A publication Critical patent/CN109524068A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Medicines Containing Plant Substances (AREA)

Abstract

The present invention provides a kind of disease symptoms extracting methods based on AC automatic machine, step 1: constructing dictionary tree using symptom word dictionary;Step 2: carrying out unsuccessfully the construction of pointer, realize AC automatic machine algorithm;Step 3: electronic health record information is converted into the coded format of UTF-8;Step 4: the symptom word in electronic health record information being matched using AC automatic machine algorithm;If completely matched, the symptom word is marked and extracted, while continuing to read electronic health record information down, until reading terminal;Step 5: if having matched one or more word, but could not successful match, the father node of the position takes unsuccessfully node upwards along symptom dictionary tree, and enters step 4.The present invention can carry out the symptom word in unstructured electronic health record effectively and quickly to extract, to facilitate the research of the automatic monitoring aspect of adverse drug reaction, help to realize the design and optimization of adverse drug reaction spontaneous reporting system.

Description

A kind of disease symptoms extracting method based on AC automatic machine
Technical field
The present invention relates to symptom matching technique fields in the non-structural medicine text such as electronic health record, more particularly to a kind of medicine The involved disease symptoms extracting method of object adverse reaction detection.
Background technique
The symptom information generated after the medication wherein covered is extracted from the unstructured electronic health record information of patient, is Realize the basis of adverse drug reaction monitored automatically.
Aho-Corasick automatic machine algorithm (abbreviation AC automatic machine algorithm) originates from dictionary tree algorithm, is main multimode One of formula matching algorithm.AC automatic machine algorithm possesses linear worst time complexity, flexible height, the short mode of tolerable, can resist Outstanding advantages of complexity is attacked, is one of the On-line matching algorithm of presently relevant field technical staff first choice.
AC automatic machine algorithm is primarily adapted for use in pattern match field, intrusion detection field and quick Chinese word segmentation neck Domain.However, application or blank of the AC automatic machine algorithm in the extraction of disease symptoms word, based on the understanding to its advantage, This invention address that by AC automatic machine algorithm improvement and being applied in the extraction of disease symptoms word.
Summary of the invention
The technical problem to be solved by the present invention is how to have in the long text shaped like unstructured electronic health record information It imitates and quickly extracts symptom word caused by the bad kickback of using medicine covered in medical record information.
In order to solve the above-mentioned technical problem, the technical solution of the present invention is to provide a kind of disease symptoms based on AC automatic machine Extracting method, it is characterised in that:, this method is made of following 5 steps:
Step 1: constructing dictionary tree using symptom word dictionary;
Step 2: carrying out unsuccessfully the construction of pointer, realize AC automatic machine algorithm;
Step 3: electronic health record information is converted into the coded format of UTF-8;
Step 4: using AC automatic machine algorithm to symptom word caused by the bad kickback of using medicine in electronic health record information It is matched;If completely having matched the symptom word in symptom word dictionary in electronic health record information, mark simultaneously The symptom word is extracted, while continuing to read electronic health record information down, until reading terminal;
Step 5: if having matched one or more word, but could not successful match, along dictionary tree the position father knot Point takes unsuccessfully node upwards, and enters step 4.
Preferably, in the step 1, when constructing dictionary tree, side on from root node to the path of any one node Ordered set represents the correspondence prefix of symptom word in symptom word dictionary.
Preferably, the detailed process of the step 2 are as follows: one pointer of setting, original state are directed toward symptom word dictionary Root node traverses electronic health record information, for each of electronic health record information word, if with symptom word word from front to back The corresponding word of pointer in allusion quotation is identical, then pointer is directed toward the child node of the word, circulation matching is until failure, unsuccessfully pointer at this time The node of direction continues same matching, when encountering termination node, counter+1.
Method provided by the invention can be under big data environmental background, the medication in unstructured electronic health record be bad It reacts generated symptom word effectively and quickly extract, to facilitate the automatic monitoring aspect of adverse drug reaction Research, help to realize the design and optimization of adverse drug reaction spontaneous reporting system.
Detailed description of the invention
Fig. 1 is to construct dictionary tree exemplary diagram based on symptom word dictionary;
Fig. 2 is the construction exemplary diagram of failure pointer.
Specific embodiment
Present invention will be further explained below with reference to specific examples.
A kind of disease symptoms extracting method based on AC automatic machine is present embodiments provided first to use using AC automatic machine Symptom word dictionary constructs dictionary tree, then carries out unsuccessfully the construction of pointer.After AC automatic machine is realized, confirmation character string is UTF- After 8 coded format, the matching of symptom word is carried out.
Specific implementation process is:
Step 1: constructing dictionary tree based on symptom word dictionary.Side on from root node to the path of any one node Ordered set represent the correspondence prefix of symptom word in dictionary.As shown in Figure 1, " tinnitus " and " earplug " has common prefix " ear ", " dizziness " and " headache " have common prefix " head ".
Second step carries out unsuccessfully the construction of pointer.A pointer is set, original state is directed toward the root node of symptom dictionary, Electronic health record information is traversed from front to back, for each of electronic health record information word, if with the pointer in symptom dictionary Corresponding word is identical, then pointer is directed toward the child node of the word, circulation is matched until failure, the node that unsuccessfully pointer is directed toward at this time Continue same matching, when encountering termination node, counter+1.As shown in Fig. 2, in the matching process, failure pointer is from " point Ear " jumps to " tinnitus ", does not return to origin and restarts to match, but from No. 4 position transfers to No. 3 positions, such algorithm Time complexity be it is linear, do not do any duplicate matching.
Third step, the conversion of electronic health record message encoding format.Before the matching for entering symptom word, by electronics disease The coded format that information is converted into UTF-8 is gone through, 16 in UTF-8 coding are encoded to 0800-FFFF due to Chinese character, institute To indicate Chinese character with English alphabet to reach at 4 16 binary digits using by Chinese character separating.
4th step, matches character string.If completely had matched in symptom dictionary in electronic health record information Word marks and extracts word, while continuing to read electronic health record information down until reading terminal.
5th step, if having matched one or more word, but could not successful match, the position along symptom dictionary tree Father node takes unsuccessfully node upwards, and enters the 4th step.
Method provided in this embodiment carries out symptom using AC automatic machine for the unstructured electronic health record information of importing The matching of word simultaneously is completed to extract.This method has taken into account the energy that dictionary tree solves the problems, such as word (short text) multi-mode matching The advantages of power and KMP algorithm solve the problems, such as the ability of the single pattern matching of long text, integrate dpd mode matching algorithm, tool Linear worst time complexity, good efficiency, high flexibility opinion can resist the advantages that complexity attack.
The above, only presently preferred embodiments of the present invention, not to the present invention in any form with substantial limitation, It should be pointed out that under the premise of not departing from the method for the present invention, can also be made for those skilled in the art Several improvement and supplement, these are improved and supplement also should be regarded as protection scope of the present invention.All those skilled in the art, Without departing from the spirit and scope of the present invention, when made using disclosed above technology contents it is a little more Dynamic, modification and the equivalent variations developed, are equivalent embodiment of the invention;Meanwhile all substantial technologicals pair according to the present invention The variation, modification and evolution of any equivalent variations made by above-described embodiment, still fall within the range of technical solution of the present invention It is interior.

Claims (3)

1. a kind of disease symptoms extracting method based on AC automatic machine, which is characterized in that this method is made of following 5 steps:
Step 1: constructing dictionary tree using symptom word dictionary;
Step 2: carrying out unsuccessfully the construction of pointer, realize AC automatic machine algorithm;
Step 3: electronic health record information is converted into the coded format of UTF-8;
Step 4: symptom word caused by the bad kickback of using medicine in electronic health record information being carried out using AC automatic machine algorithm Matching;If completely having matched the symptom word in symptom word dictionary in electronic health record information, marks and extract The symptom word out, while continuing to read electronic health record information down, until reading terminal;
Step 5: if having matched one or more word, but could not successful match, along dictionary tree the father node of the position to On take unsuccessfully node, and enter step 4.
2. a kind of disease symptoms extracting method based on AC automatic machine as described in claim 1, it is characterised in that: the step In 1, when constructing dictionary tree, the ordered set on the side on from root node to the path of any one node represents symptom word dictionary The correspondence prefix of middle symptom word.
3. a kind of disease symptoms extracting method based on AC automatic machine as described in claim 1, it is characterised in that: the step 2 detailed process are as follows: one pointer of setting, original state are directed toward the root node of dictionary tree, traverse electronic health record letter from front to back Breath, for each of electronic health record information word, if identical as the corresponding word of pointer in symptom word dictionary, refers to Needle is directed toward the child node of the word, and until failure, the node that unsuccessfully pointer is directed toward at this time continues same matching for circulation matching, works as chance To termination node, counter+1.
CN201811201375.9A 2018-10-16 2018-10-16 A kind of disease symptoms extracting method based on AC automatic machine Pending CN109524068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811201375.9A CN109524068A (en) 2018-10-16 2018-10-16 A kind of disease symptoms extracting method based on AC automatic machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811201375.9A CN109524068A (en) 2018-10-16 2018-10-16 A kind of disease symptoms extracting method based on AC automatic machine

Publications (1)

Publication Number Publication Date
CN109524068A true CN109524068A (en) 2019-03-26

Family

ID=65770865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811201375.9A Pending CN109524068A (en) 2018-10-16 2018-10-16 A kind of disease symptoms extracting method based on AC automatic machine

Country Status (1)

Country Link
CN (1) CN109524068A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191103A (en) * 2019-12-30 2020-05-22 河南拓普计算机网络工程有限公司 Method, device and storage medium for identifying and analyzing enterprise subject information from internet
CN111341458A (en) * 2020-02-27 2020-06-26 国家卫生健康委科学技术研究所 Single-gene disease name recommendation method and system based on multi-level structure similarity
CN113555069A (en) * 2021-07-22 2021-10-26 杭州叙简科技股份有限公司 Chemical name retrieval and extraction method and device based on AC automaton
CN114580414A (en) * 2022-02-24 2022-06-03 医渡云(北京)技术有限公司 Entity identification method and device based on AC automaton and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system
CN105183788A (en) * 2015-08-20 2015-12-23 及时标讯网络信息技术(北京)有限公司 Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN108021569A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus
CN105260354B (en) * 2015-08-20 2018-08-21 及时标讯网络信息技术(北京)有限公司 A kind of Chinese AC automatic machines working method based on keyword dictionary tree construction
CN108628907A (en) * 2017-03-24 2018-10-09 北京京东尚科信息技术有限公司 A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system
CN105183788A (en) * 2015-08-20 2015-12-23 及时标讯网络信息技术(北京)有限公司 Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree
CN105260354B (en) * 2015-08-20 2018-08-21 及时标讯网络信息技术(北京)有限公司 A kind of Chinese AC automatic machines working method based on keyword dictionary tree construction
CN108021569A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus
CN108628907A (en) * 2017-03-24 2018-10-09 北京京东尚科信息技术有限公司 A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191103A (en) * 2019-12-30 2020-05-22 河南拓普计算机网络工程有限公司 Method, device and storage medium for identifying and analyzing enterprise subject information from internet
CN111191103B (en) * 2019-12-30 2021-08-24 河南拓普计算机网络工程有限公司 Method, device and storage medium for identifying and analyzing enterprise subject information from internet
CN111341458A (en) * 2020-02-27 2020-06-26 国家卫生健康委科学技术研究所 Single-gene disease name recommendation method and system based on multi-level structure similarity
CN111341458B (en) * 2020-02-27 2020-11-03 国家卫生健康委科学技术研究所 Single-gene disease name recommendation method and system based on multi-level structure similarity
CN113555069A (en) * 2021-07-22 2021-10-26 杭州叙简科技股份有限公司 Chemical name retrieval and extraction method and device based on AC automaton
CN114580414A (en) * 2022-02-24 2022-06-03 医渡云(北京)技术有限公司 Entity identification method and device based on AC automaton and electronic equipment

Similar Documents

Publication Publication Date Title
CN109524068A (en) A kind of disease symptoms extracting method based on AC automatic machine
Yu et al. Self-chained image-language model for video localization and question answering
CN103838875B (en) A kind of information acquisition system and its method based on Quick Response Code
CN113468888A (en) Entity relation joint extraction method and device based on neural network
CN104598577B (en) A kind of extracting method of Web page text
CN102185762B (en) Method for recognizing, extracting user data sending behavior
CN105677710A (en) Processing method and system of big data
CN102867049B (en) Chinese PINYIN quick word segmentation method based on word search tree
CN107992211A (en) A kind of Chinese character spelling wrong word correcting method based on CNN-LSTM
CN106095735A (en) A kind of method plagiarized based on deep neural network detection academic documents
CN107729316A (en) The identification of wrong word and the method and device of error correction in the interactive question and answer text of Chinese
CN108647511A (en) The password strength assessment method derived based on weak passwurd
CN113971404A (en) Cultural relic security named entity identification method based on decoupling attention
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
CN105068889A (en) Method for recovering completely deleted files in Ext3/Ext4
CN104360988B (en) The recognition methods of the coded system of Chinese character and device
CN107239520A (en) A kind of universal forum context extraction method
CN104079450A (en) Method and device for generating characteristic pattern set
CN105592087A (en) DNP abnormity detection method based on vector machine learning
CN117056475A (en) Knowledge graph-based intelligent manufacturing question-answering method, device and storage medium
CN116166768A (en) Text knowledge extraction method and system based on rules
CN116776889A (en) Guangdong rumor detection method based on graph convolution network and external knowledge embedding
CN106055542B (en) A kind of text snippet automatic generation method and system based on temporal knowledge extraction
CN108021711A (en) A kind of method of information processing
CN105975451A (en) Processing system and method for DWG-format-file translation data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190326

RJ01 Rejection of invention patent application after publication