CN114360729A - Medical text information automatic extraction method based on deep neural network - Google Patents

Medical text information automatic extraction method based on deep neural network Download PDF

Info

Publication number
CN114360729A
CN114360729A CN202111413366.8A CN202111413366A CN114360729A CN 114360729 A CN114360729 A CN 114360729A CN 202111413366 A CN202111413366 A CN 202111413366A CN 114360729 A CN114360729 A CN 114360729A
Authority
CN
China
Prior art keywords
data
word
model
training
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111413366.8A
Other languages
Chinese (zh)
Inventor
陈运文
纪达麒
唐文瀚
余海东
肖茂
许瑞玲
王俊
蔡冲
夏凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202111413366.8A priority Critical patent/CN114360729A/en
Publication of CN114360729A publication Critical patent/CN114360729A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a method for automatically extracting medical text information, which takes historical accumulated extracted data as a labeled data set, builds a deep neural network model, realizes the input of unstructured text data of medical insurance and outputs entity information and relationship set by a special medical insurance auditor. The method realizes the input of unstructured text data of the medical insurance and the output of entity information and relationship set by a specific medical insurance auditor, thereby solving the problems of low efficiency and low accuracy caused by the fact that key medical insurance information needs to be manually collated or verified by the auditor in the auditing process.

Description

Medical text information automatic extraction method based on deep neural network
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method for automatically extracting medical text information and a system based on the method.
Background
The insurance audit adopts a big data method based on a knowledge graph to audit under a mode of considering the medical insurance total data. In the construction process of the knowledge graph, the most core step is the automatic extraction of information, however, medical insurance audit data sources are many, data acquisition objects comprise medical insurance departments, health departments, centralized acquisition mechanisms, fixed-point medical mechanisms and external data, and the content of the data is different, such as medical insurance of workers, fund finances, medicines, materials and the like.
In the face of such huge and complicated data volume, how to realize automatic information extraction is a technical key, and information extraction includes extraction of entities, entity relationships and entity attributes, and can be specifically described as a triple S-P-O (Subject-previous-Object) form.
In the prior art, the method for extracting information during auditing and establishing the knowledge graph mainly comprises the steps of manually extracting and arranging useful information from mass data in a centralized manner, relying on structured data uploaded by audited units, and matching by adopting rules or business logic. The method has the advantages that the authenticity and the comprehensiveness of the data are to be verified, a large amount of labor and time cost are needed, the service familiarity degree is relied on, the auditing requirement time is short, and the task is heavy.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a novel medical text information automatic extraction method and system based on DGCNN + Attention. The method and the system read data from different medical insurance data sources, automatically extract S-P-O entity relation information required by audit from complex data, and build a medical insurance audit knowledge graph with assistance, and have the advantages of high extraction speed and high accuracy.
In order to achieve the purpose of the invention, the technical scheme provided by the invention patent is as follows:
a medical text information automatic extraction method based on a deep neural network is characterized in that historical accumulated extraction data are used as a labeling data set, a deep neural network model is built, medical insurance unstructured text data are input, and entity information and relationships set by specific medical insurance auditors are output.
In the automatic extraction method of medical text information based on the deep neural network, the method specifically comprises the following implementation steps:
in the training data preparation stage, collecting labeled corpus data as much as possible to form a data set, using information used for auditing the data set according to the medical insurance data of the past years as standard data, labeling an unstructured text data set by adopting a multi-mode matching algorithm, and dividing the labeled data set into a training set and a test set according to the ratio of 8: 2;
in the data preprocessing stage, training a Word vector model, adopting a Word segmentation device tool including a Chinese Word segmentation to filter stop words in a training set, segmenting words, training a Word2Vec Word vector model, traversing an input text to obtain a Word ID, performing random initial Word vector on the Word ID, combining the trained Word vector, and obtaining a mixed Word vector through matrix transformation;
in the model training stage, mixed word vectors are used as input, the labeled relation is used as output, multiple rounds of iterative training are carried out according to the deep neural network model, and the training model is stored;
in the data prediction stage, inputting a data text to be extracted in a trained model, and outputting an entity relationship, wherein the entity relationship is as follows: main word-predicate word-object word.
In the automatic extraction method of the medical text information based on the deep neural network, the multi-pattern matching algorithm is an AC automaton.
In the automatic extraction method of medical text information based on the deep neural network, in the model training stage, position coding is combined as model input, which is marked as E, the E is input into a 12-layer deep neural network model structure, new output is obtained through operation, which is marked as H1, an H1 vector is transmitted into a self-attention layer, then a convolutional layer and a full-connection layer are passed, the head and tail positions of S are predicted, a label S is randomly sampled, a sub-vector corresponding to H1 is mapped and input into a bidirectional sequence model, a coding vector of S is obtained, the coding vector of S is a coding vector with the same length as an input sequence, after the H1 is transmitted into another self-attention layer, the output vector is spliced and marked as H2, the spliced H2 is transmitted into the convolutional layer and the full-connection layer, and finally a double Sigmoid structure is adopted as an activation function to predict O, and P position, storing the training model to the local.
Based on the technical scheme, compared with the prior art, the medical text information automatic extraction method based on the deep neural network and the system based on the method have the following technical effects:
1. the method for automatically extracting the medical text information based on the deep neural network and the system based on the method only use a convolution network structure, an attention mechanism and a shorter LSTM structure in a model architecture, and have high model speed efficiency.
2. The medical text information automatic extraction method based on the deep neural network and the system based on the method have the advantages that the algorithm framework is in an end-to-end form, the relation extraction can be completed through one step, the end-to-end model training and prediction are realized, the method is greatly superior to the existing two-step extraction mode, namely, the entity is extracted first and then the relation is obtained.
3. The invention discloses a medical text information automatic extraction method based on a deep neural network and a system based on the method, wherein double Sigmoid function output is adopted, and S-P-O extraction tasks with various relations are realized.
Drawings
Fig. 1 is a schematic flow chart of an implementation of the method for automatically extracting medical text information based on a deep neural network according to the present invention.
Detailed Description
The method for automatically extracting medical text information and the system based on the method according to the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments, so as to clearly understand the operation process and the processing manner thereof, but the protection scope of the present invention is not limited thereby.
According to the method, historical accumulated extracted data is used as a labeled data set, a deep neural network model based on DGCNN + Attention is built, unstructured text data of the medical insurance are input, entity information and relationships set by a specific medical insurance auditor are output, and therefore the problem that key medical insurance information needs to be manually sorted or verified by the auditor in the auditing process is solved.
A medical text information automatic extraction method based on a deep neural network comprises a training data preparation stage, a data preprocessing stage, a model training stage and a data prediction stage.
In the automatic extraction method of medical text information based on the deep neural network, the method specifically comprises the following implementation steps:
in the training data preparation stage, collecting labeled corpus data as much as possible to form a data set, using information used for auditing the data set according to the medical insurance data of the past years as standard data, labeling an unstructured text data set by adopting a multi-mode matching algorithm, and dividing the labeled data set into a training set and a test set according to the ratio of 8: 2; in the embodiment, the multi-pattern matching algorithm adopts an AC automaton, and is a typical multi-pattern matching algorithm.
In the data preprocessing stage, a Word vector model is trained, a Word segmentation device tool including a Chinese character string is adopted to filter stop words of a training set, words are subdivided, a Word2Vec Word vector model is trained, a Word ID is obtained by traversing an input text, a random initial Word vector is carried out on the Word ID, a mixed Word vector is obtained by combining the trained Word vectors and through matrix transformation, a Word ID sequence is loaded, and a Word vector with a specified dimension is obtained through a random initial Word vector layer.
In the model training stage, mixed word vectors are used as input, the labeled relation is used as output, multiple rounds of iterative training are carried out according to the deep neural network model, and the training model is stored. In the model training phase, the Position Embedding structural formula is combined as Position coding, model input is carried out and is marked as E, inputting the E into a 12-layer deep neural network model structure, obtaining a new output through calculation, marking as H1, transmitting an H1 vector into a Self-Attention (Self-Attention) layer, predicting the head and tail positions of S through a convolutional layer CNN and a full connection layer Dense, randomly sampling a label S, mapping a sub-vector corresponding to H1, inputting the sub-vector into a bidirectional LSTM sequence model to obtain a coding vector of S, the encoding vector of S is the encoding vector with the same length as the input sequence, after H1 is transmitted into another Self-orientation layer, and (3) splicing the output vector, recording the vector as H2, transmitting the spliced H2 into the convolutional layer CNN and the full-connection layer Dense, finally predicting the positions of O and P by adopting a double-Sigmoid structure as an activation function, and storing the training model to the local. The dual Sigmoid structure serves as a commonly used activation function.
In the data prediction stage, inputting a data text to be extracted in a trained model, and outputting an entity relationship, wherein the entity relationship is as follows: main word-predicate word-object word.
As shown in fig. 1, in practical application, the automatic extraction method of medical text information based on deep neural network includes the following operation steps:
the method comprises the steps of firstly, proposing a requirement for automatic extraction of medical text information, and starting an extraction process;
secondly, collecting a medical data set of the past years;
thirdly, labeling relation entities, namely main words, predicate words and object words;
fourthly, carrying out the word segmentation and training a word vector model;
fifthly, obtaining a mixed word vector;
sixthly, a sequence neural network entity relationship model;
step seven, inputting texts and predicting entity relations existing in the texts;
and eighthly, finishing the prediction and ending the information extraction operation of the medical text.
Example 1
After model training is complete, we take the following medical text information input as a test:
firstly, information input content: 1. bronchitis, emphysema; 2. peripheral lung cancer is considered for the left supralung lobe lump; left pulmonary portal lymph node enlargement, considered metastasis; 3. right lung lobes change, considered hypoplasia; 4. left scapular medial elastofibroma; 5. a tracheal diverticulum; 6. thyroid gland right-lobe low density foci; the wall of the antrum is thickened, please combine with clinic. The left lung upper lobe can see a similar round lump shadow with the size of about 2.0X 3.0CM and the CT value of about 32HU, and the CT values in the third stage of CT scan are 43HU, 53HU and 75HU respectively, and partial branch obstruction and stenosis of the bronchus can be seen; the volume of the right lung lobes is reduced, and a flaky high-density shadow is seen, and a slightly dilated bronchus shadow is seen inside; the permeability of the two lungs is enhanced, and the fields of the two lungs are seen in a multi-saccular light-transmitting area; the two lungs have sparse and disordered veins. The left pulmonary lymph node is slightly larger and about 1.4CM in diameter. There are many small lymph nodes in the mediastinum. Both thoracic cavities were free of signs of fluid accumulation. Calcification of the aorta and coronary arteries. The medial infrascapular side of the right shoulder blade was seen as a lamellar soft tissue density image of approximately 2.2CM by 5.1 CM. A diverticulum of the trachea. The density of the thyroid right lobe is reduced, and the strengthening degree is lower than that of the normal thyroid tissue. The wall of the antrum is thickened.
Secondly, an extraction implementation mode:
1. and after the disabled words are input, performing word segmentation on the text by adopting a Chinese character segmentation tool. The output is [ "bronchus", "tracheitis", "emphysema", "left lung", … … ]
2. And reading the trained word vector model to obtain the word vector. The outputs are [ [0.001,0.089, -0.201, … ], [0.121, -0.012, -0.314, … ], [ -0.809,0.121,0.214, … ], … ]
3. Each word of the text is traversed and a word vector is randomly initialized. The output is [ [0.121,0.251, -0.129, … ], [ -0.901, -0.252, -0.124, … ], [0.124,0.853,0.982, … ], … ]
4. And acquiring a mixed word vector according to a preprocessing method. The output is [ [0.321,0.261, -0.156, … ], [ -0.081, -0.004, -0.094, … ], [0.024, -0.813, -0.782, … ], … ]
5. And inputting the mixed word vector into a trained neural network model, and outputting the initial position probability of the subject [0.002,0.208,0.1023, … ], the ending position probability of the subject [0.001,0.001,0.005, …,0.238,0.001], and connecting the positions of the initial probability maximum probability and the ending maximum probability to obtain the subject which is the upper lobe of the left lung. And obtaining the predicates and the objects in the same way.
6. The final output is [ left superior lung lobe, left supraglottic lymph node, 2.0 x 3.0CM ]
Thirdly, outputting an information extraction result:
left superior lung lobe (S primary tumor site) left pulmonary portal lymph node (P left pulmonary portal lymph node) 2.0 x 3.0CM (O primary lesion size).
Example 2
Medical text input: patients who open XX before 1 month have no obvious inducement to have retraction snivel, no symptoms such as nasal obstruction, facial numbness, double vision, hearing loss, headache and the like, and the local XXX hospital is treated for diagnosis and treatment, nasopharyngoscope is perfected and biopsy is taken: undifferentiated non-keratinizing carcinoma.
Outputting an information extraction result: zhang XX (S patient name) visit (P patient and hospital relationship) XXX hospital (O visit hospital name)
The extraction implementation processing procedure of this embodiment refers to embodiment 1.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1. A medical text information automatic extraction method based on a deep neural network is characterized in that historical accumulated extraction data are used as a labeling data set, a deep neural network model is built, medical insurance unstructured text data are input, and entity information and relationships set by specific medical insurance auditors are output.
2. The automatic extraction method of medical text information based on deep neural network as claimed in claim 1, wherein the method specifically comprises the following implementation steps:
in the training data preparation stage, collecting labeled corpus data as much as possible to form a data set, using information used for auditing the data set according to the medical insurance data of the past years as standard data, labeling an unstructured text data set by adopting a multi-mode matching algorithm, and dividing the labeled data set into a training set and a test set according to the ratio of 8: 2;
in the data preprocessing stage, training a Word vector model, adopting a Word segmentation device tool including a Chinese Word segmentation to filter stop words in a training set, segmenting words, training a Word2Vec Word vector model, traversing an input text to obtain a Word ID, performing random initial Word vector on the Word ID, combining the trained Word vector, and obtaining a mixed Word vector through matrix transformation;
in the model training stage, mixed word vectors are used as input, the labeled relation is used as output, multiple rounds of iterative training are carried out according to the deep neural network model, and the training model is stored;
in the data prediction stage, inputting a data text to be extracted in a trained model, and outputting an entity relationship, wherein the entity relationship is as follows: main word-predicate word-object word.
3. The method as claimed in claim 2, wherein the multi-pattern matching algorithm is an Aho-Corsick automaton.
4. The method as claimed in claim 2, wherein in the model training stage, position coding is used as model input, denoted as E, the E is input into a 12-layer deep neural network model structure, a new output is obtained through calculation, denoted as H1, an H1 vector is transmitted into a self-attention layer, then the head and tail positions of S are predicted, a label S is randomly sampled, a sub-vector corresponding to H1 is mapped and input into a bidirectional sequence model, a coding vector of S is obtained, the coding vector of S is a coding vector with a length equal to that of the input sequence, after H1 is transmitted into another self-attention layer, a vector output is spliced, denoted as H2, H2 after splicing is transmitted into a convolutional layer and a fully-connected layer, and finally O is predicted by using a double-S function structure, and P position, storing the training model to the local.
CN202111413366.8A 2021-11-25 2021-11-25 Medical text information automatic extraction method based on deep neural network Pending CN114360729A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111413366.8A CN114360729A (en) 2021-11-25 2021-11-25 Medical text information automatic extraction method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111413366.8A CN114360729A (en) 2021-11-25 2021-11-25 Medical text information automatic extraction method based on deep neural network

Publications (1)

Publication Number Publication Date
CN114360729A true CN114360729A (en) 2022-04-15

Family

ID=81096257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111413366.8A Pending CN114360729A (en) 2021-11-25 2021-11-25 Medical text information automatic extraction method based on deep neural network

Country Status (1)

Country Link
CN (1) CN114360729A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306589A (en) * 2023-05-10 2023-06-23 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666350A (en) * 2020-05-28 2020-09-15 浙江工业大学 Method for extracting medical text relation based on BERT model
WO2020211275A1 (en) * 2019-04-18 2020-10-22 五邑大学 Pre-trained model and fine-tuning technology-based medical text relationship extraction method
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network
CN113360671A (en) * 2021-06-16 2021-09-07 浙江工业大学 Medical insurance medical document auditing method and system based on knowledge graph
CN113486667A (en) * 2021-07-26 2021-10-08 辽宁工程技术大学 Medical entity relationship joint extraction method based on entity type information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211275A1 (en) * 2019-04-18 2020-10-22 五邑大学 Pre-trained model and fine-tuning technology-based medical text relationship extraction method
CN111666350A (en) * 2020-05-28 2020-09-15 浙江工业大学 Method for extracting medical text relation based on BERT model
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network
CN113360671A (en) * 2021-06-16 2021-09-07 浙江工业大学 Medical insurance medical document auditing method and system based on knowledge graph
CN113486667A (en) * 2021-07-26 2021-10-08 辽宁工程技术大学 Medical entity relationship joint extraction method based on entity type information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306589A (en) * 2023-05-10 2023-06-23 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene
CN116306589B (en) * 2023-05-10 2024-02-09 之江实验室 Method and device for medical text error correction and intelligent extraction of emergency scene

Similar Documents

Publication Publication Date Title
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
CN109635280A (en) A kind of event extraction method based on mark
CN110390021A (en) Drug knowledge mapping construction method, device, computer equipment and storage medium
CN110032739A (en) Chinese electronic health record name entity abstracting method and system
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
CN105184053B (en) A kind of automatic coding and system of Chinese medical service item information
CN112560478B (en) Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation
CN111651991B (en) Medical named entity identification method utilizing multi-model fusion strategy
CN108182972A (en) The intelligent coding method and system of Chinese medical diagnosis on disease based on participle network
US11972214B2 (en) Method and apparatus of NER-oriented chinese clinical text data augmentation
CN113051399B (en) Small sample fine-grained entity classification method based on relational graph convolutional network
CN114091450B (en) Judicial domain relation extraction method and system based on graph convolution network
CN114360729A (en) Medical text information automatic extraction method based on deep neural network
CN115510236A (en) Chapter-level event detection method based on information fusion and data enhancement
CN116049459A (en) Cross-modal mutual retrieval method, device, server and storage medium
CN114510928B (en) Universal information extraction method and system based on unified structure generation
CN114638228A (en) Chinese named entity recognition method based on word set self-attention
CN113254602B (en) Knowledge graph construction method and system for science and technology policy field
CN112069825B (en) Entity relation joint extraction method for alert condition record data
CN117708339A (en) ICD automatic coding method based on pre-training language model
Feldman et al. VesselVAE: Recursive Variational Autoencoders for 3D Blood Vessel Synthesis
CN111798324A (en) Medical insurance fraud discovery method based on dynamic hospitalizing behavior alignment
CN110502236A (en) Based on the decoded front-end code generation method of Analysis On Multi-scale Features, system and equipment
CN110364255A (en) A kind of hepatopathy appraisal procedure based on self-encoding encoder
CN113157255B (en) Code generation method for syntax tree decoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination