CN114360729A

CN114360729A - Medical text information automatic extraction method based on deep neural network

Info

Publication number: CN114360729A
Application number: CN202111413366.8A
Authority: CN
Inventors: 陈运文; 纪达麒; 唐文瀚; 余海东; 肖茂; 许瑞玲; 王俊; 蔡冲; 夏凯
Original assignee: Daguan Data Chengdu Co ltd
Current assignee: Daguan Data Chengdu Co ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-15

Abstract

The invention relates to a method for automatically extracting medical text information, which takes historical accumulated extracted data as a labeled data set, builds a deep neural network model, realizes the input of unstructured text data of medical insurance and outputs entity information and relationship set by a special medical insurance auditor. The method realizes the input of unstructured text data of the medical insurance and the output of entity information and relationship set by a specific medical insurance auditor, thereby solving the problems of low efficiency and low accuracy caused by the fact that key medical insurance information needs to be manually collated or verified by the auditor in the auditing process.

Description

Medical text information automatic extraction method based on deep neural network

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method for automatically extracting medical text information and a system based on the method.

Background

The insurance audit adopts a big data method based on a knowledge graph to audit under a mode of considering the medical insurance total data. In the construction process of the knowledge graph, the most core step is the automatic extraction of information, however, medical insurance audit data sources are many, data acquisition objects comprise medical insurance departments, health departments, centralized acquisition mechanisms, fixed-point medical mechanisms and external data, and the content of the data is different, such as medical insurance of workers, fund finances, medicines, materials and the like.

In the face of such huge and complicated data volume, how to realize automatic information extraction is a technical key, and information extraction includes extraction of entities, entity relationships and entity attributes, and can be specifically described as a triple S-P-O (Subject-previous-Object) form.

In the prior art, the method for extracting information during auditing and establishing the knowledge graph mainly comprises the steps of manually extracting and arranging useful information from mass data in a centralized manner, relying on structured data uploaded by audited units, and matching by adopting rules or business logic. The method has the advantages that the authenticity and the comprehensiveness of the data are to be verified, a large amount of labor and time cost are needed, the service familiarity degree is relied on, the auditing requirement time is short, and the task is heavy.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a novel medical text information automatic extraction method and system based on DGCNN + Attention. The method and the system read data from different medical insurance data sources, automatically extract S-P-O entity relation information required by audit from complex data, and build a medical insurance audit knowledge graph with assistance, and have the advantages of high extraction speed and high accuracy.

In order to achieve the purpose of the invention, the technical scheme provided by the invention patent is as follows:

a medical text information automatic extraction method based on a deep neural network is characterized in that historical accumulated extraction data are used as a labeling data set, a deep neural network model is built, medical insurance unstructured text data are input, and entity information and relationships set by specific medical insurance auditors are output.

In the automatic extraction method of medical text information based on the deep neural network, the method specifically comprises the following implementation steps:

in the training data preparation stage, collecting labeled corpus data as much as possible to form a data set, using information used for auditing the data set according to the medical insurance data of the past years as standard data, labeling an unstructured text data set by adopting a multi-mode matching algorithm, and dividing the labeled data set into a training set and a test set according to the ratio of 8: 2;

in the data preprocessing stage, training a Word vector model, adopting a Word segmentation device tool including a Chinese Word segmentation to filter stop words in a training set, segmenting words, training a Word2Vec Word vector model, traversing an input text to obtain a Word ID, performing random initial Word vector on the Word ID, combining the trained Word vector, and obtaining a mixed Word vector through matrix transformation;

in the model training stage, mixed word vectors are used as input, the labeled relation is used as output, multiple rounds of iterative training are carried out according to the deep neural network model, and the training model is stored;

in the data prediction stage, inputting a data text to be extracted in a trained model, and outputting an entity relationship, wherein the entity relationship is as follows: main word-predicate word-object word.

In the automatic extraction method of the medical text information based on the deep neural network, the multi-pattern matching algorithm is an AC automaton.

In the automatic extraction method of medical text information based on the deep neural network, in the model training stage, position coding is combined as model input, which is marked as E, the E is input into a 12-layer deep neural network model structure, new output is obtained through operation, which is marked as H1, an H1 vector is transmitted into a self-attention layer, then a convolutional layer and a full-connection layer are passed, the head and tail positions of S are predicted, a label S is randomly sampled, a sub-vector corresponding to H1 is mapped and input into a bidirectional sequence model, a coding vector of S is obtained, the coding vector of S is a coding vector with the same length as an input sequence, after the H1 is transmitted into another self-attention layer, the output vector is spliced and marked as H2, the spliced H2 is transmitted into the convolutional layer and the full-connection layer, and finally a double Sigmoid structure is adopted as an activation function to predict O, and P position, storing the training model to the local.

Based on the technical scheme, compared with the prior art, the medical text information automatic extraction method based on the deep neural network and the system based on the method have the following technical effects:

1. the method for automatically extracting the medical text information based on the deep neural network and the system based on the method only use a convolution network structure, an attention mechanism and a shorter LSTM structure in a model architecture, and have high model speed efficiency.

2. The medical text information automatic extraction method based on the deep neural network and the system based on the method have the advantages that the algorithm framework is in an end-to-end form, the relation extraction can be completed through one step, the end-to-end model training and prediction are realized, the method is greatly superior to the existing two-step extraction mode, namely, the entity is extracted first and then the relation is obtained.

3. The invention discloses a medical text information automatic extraction method based on a deep neural network and a system based on the method, wherein double Sigmoid function output is adopted, and S-P-O extraction tasks with various relations are realized.

Drawings

Fig. 1 is a schematic flow chart of an implementation of the method for automatically extracting medical text information based on a deep neural network according to the present invention.

Detailed Description

The method for automatically extracting medical text information and the system based on the method according to the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments, so as to clearly understand the operation process and the processing manner thereof, but the protection scope of the present invention is not limited thereby.

According to the method, historical accumulated extracted data is used as a labeled data set, a deep neural network model based on DGCNN + Attention is built, unstructured text data of the medical insurance are input, entity information and relationships set by a specific medical insurance auditor are output, and therefore the problem that key medical insurance information needs to be manually sorted or verified by the auditor in the auditing process is solved.

A medical text information automatic extraction method based on a deep neural network comprises a training data preparation stage, a data preprocessing stage, a model training stage and a data prediction stage.

in the training data preparation stage, collecting labeled corpus data as much as possible to form a data set, using information used for auditing the data set according to the medical insurance data of the past years as standard data, labeling an unstructured text data set by adopting a multi-mode matching algorithm, and dividing the labeled data set into a training set and a test set according to the ratio of 8: 2; in the embodiment, the multi-pattern matching algorithm adopts an AC automaton, and is a typical multi-pattern matching algorithm.

In the data preprocessing stage, a Word vector model is trained, a Word segmentation device tool including a Chinese character string is adopted to filter stop words of a training set, words are subdivided, a Word2Vec Word vector model is trained, a Word ID is obtained by traversing an input text, a random initial Word vector is carried out on the Word ID, a mixed Word vector is obtained by combining the trained Word vectors and through matrix transformation, a Word ID sequence is loaded, and a Word vector with a specified dimension is obtained through a random initial Word vector layer.

In the model training stage, mixed word vectors are used as input, the labeled relation is used as output, multiple rounds of iterative training are carried out according to the deep neural network model, and the training model is stored. In the model training phase, the Position Embedding structural formula is combined as Position coding, model input is carried out and is marked as E, inputting the E into a 12-layer deep neural network model structure, obtaining a new output through calculation, marking as H1, transmitting an H1 vector into a Self-Attention (Self-Attention) layer, predicting the head and tail positions of S through a convolutional layer CNN and a full connection layer Dense, randomly sampling a label S, mapping a sub-vector corresponding to H1, inputting the sub-vector into a bidirectional LSTM sequence model to obtain a coding vector of S, the encoding vector of S is the encoding vector with the same length as the input sequence, after H1 is transmitted into another Self-orientation layer, and (3) splicing the output vector, recording the vector as H2, transmitting the spliced H2 into the convolutional layer CNN and the full-connection layer Dense, finally predicting the positions of O and P by adopting a double-Sigmoid structure as an activation function, and storing the training model to the local. The dual Sigmoid structure serves as a commonly used activation function.

As shown in fig. 1, in practical application, the automatic extraction method of medical text information based on deep neural network includes the following operation steps:

the method comprises the steps of firstly, proposing a requirement for automatic extraction of medical text information, and starting an extraction process;

secondly, collecting a medical data set of the past years;

thirdly, labeling relation entities, namely main words, predicate words and object words;

fourthly, carrying out the word segmentation and training a word vector model;

fifthly, obtaining a mixed word vector;

sixthly, a sequence neural network entity relationship model;

step seven, inputting texts and predicting entity relations existing in the texts;

and eighthly, finishing the prediction and ending the information extraction operation of the medical text.

Example 1

After model training is complete, we take the following medical text information input as a test:

firstly, information input content: 1. bronchitis, emphysema; 2. peripheral lung cancer is considered for the left supralung lobe lump; left pulmonary portal lymph node enlargement, considered metastasis; 3. right lung lobes change, considered hypoplasia; 4. left scapular medial elastofibroma; 5. a tracheal diverticulum; 6. thyroid gland right-lobe low density foci; the wall of the antrum is thickened, please combine with clinic. The left lung upper lobe can see a similar round lump shadow with the size of about 2.0X 3.0CM and the CT value of about 32HU, and the CT values in the third stage of CT scan are 43HU, 53HU and 75HU respectively, and partial branch obstruction and stenosis of the bronchus can be seen; the volume of the right lung lobes is reduced, and a flaky high-density shadow is seen, and a slightly dilated bronchus shadow is seen inside; the permeability of the two lungs is enhanced, and the fields of the two lungs are seen in a multi-saccular light-transmitting area; the two lungs have sparse and disordered veins. The left pulmonary lymph node is slightly larger and about 1.4CM in diameter. There are many small lymph nodes in the mediastinum. Both thoracic cavities were free of signs of fluid accumulation. Calcification of the aorta and coronary arteries. The medial infrascapular side of the right shoulder blade was seen as a lamellar soft tissue density image of approximately 2.2CM by 5.1 CM. A diverticulum of the trachea. The density of the thyroid right lobe is reduced, and the strengthening degree is lower than that of the normal thyroid tissue. The wall of the antrum is thickened.

Secondly, an extraction implementation mode:

1. and after the disabled words are input, performing word segmentation on the text by adopting a Chinese character segmentation tool. The output is [ "bronchus", "tracheitis", "emphysema", "left lung", … … ]

2. And reading the trained word vector model to obtain the word vector. The outputs are [ [0.001,0.089, -0.201, … ], [0.121, -0.012, -0.314, … ], [ -0.809,0.121,0.214, … ], … ]

3. Each word of the text is traversed and a word vector is randomly initialized. The output is [ [0.121,0.251, -0.129, … ], [ -0.901, -0.252, -0.124, … ], [0.124,0.853,0.982, … ], … ]

4. And acquiring a mixed word vector according to a preprocessing method. The output is [ [0.321,0.261, -0.156, … ], [ -0.081, -0.004, -0.094, … ], [0.024, -0.813, -0.782, … ], … ]

5. And inputting the mixed word vector into a trained neural network model, and outputting the initial position probability of the subject [0.002,0.208,0.1023, … ], the ending position probability of the subject [0.001,0.001,0.005, …,0.238,0.001], and connecting the positions of the initial probability maximum probability and the ending maximum probability to obtain the subject which is the upper lobe of the left lung. And obtaining the predicates and the objects in the same way.

6. The final output is [ left superior lung lobe, left supraglottic lymph node, 2.0 x 3.0CM ]

Thirdly, outputting an information extraction result:

left superior lung lobe (S primary tumor site) left pulmonary portal lymph node (P left pulmonary portal lymph node) 2.0 x 3.0CM (O primary lesion size).

Example 2

Medical text input: patients who open XX before 1 month have no obvious inducement to have retraction snivel, no symptoms such as nasal obstruction, facial numbness, double vision, hearing loss, headache and the like, and the local XXX hospital is treated for diagnosis and treatment, nasopharyngoscope is perfected and biopsy is taken: undifferentiated non-keratinizing carcinoma.

Outputting an information extraction result: zhang XX (S patient name) visit (P patient and hospital relationship) XXX hospital (O visit hospital name)

The extraction implementation processing procedure of this embodiment refers to embodiment 1.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A medical text information automatic extraction method based on a deep neural network is characterized in that historical accumulated extraction data are used as a labeling data set, a deep neural network model is built, medical insurance unstructured text data are input, and entity information and relationships set by specific medical insurance auditors are output.

2. The automatic extraction method of medical text information based on deep neural network as claimed in claim 1, wherein the method specifically comprises the following implementation steps:

3. The method as claimed in claim 2, wherein the multi-pattern matching algorithm is an Aho-Corsick automaton.

4. The method as claimed in claim 2, wherein in the model training stage, position coding is used as model input, denoted as E, the E is input into a 12-layer deep neural network model structure, a new output is obtained through calculation, denoted as H1, an H1 vector is transmitted into a self-attention layer, then the head and tail positions of S are predicted, a label S is randomly sampled, a sub-vector corresponding to H1 is mapped and input into a bidirectional sequence model, a coding vector of S is obtained, the coding vector of S is a coding vector with a length equal to that of the input sequence, after H1 is transmitted into another self-attention layer, a vector output is spliced, denoted as H2, H2 after splicing is transmitted into a convolutional layer and a fully-connected layer, and finally O is predicted by using a double-S function structure, and P position, storing the training model to the local.