CN112685561A

CN112685561A - Small sample clinical medical text post-structuring processing method across disease categories

Info

Publication number: CN112685561A
Application number: CN202011567629.6A
Authority: CN
Inventors: 刘翔
Original assignee: Guangzhou Zhihuiyun Technology Co ltd
Current assignee: Guangzhou Zhihuiyun Technology Co ltd
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-20

Abstract

The invention discloses a small sample clinical medical text post-structuring processing method across disease categories, which comprises the following steps: acquiring small sample text information of A disease species and large sample text information of B disease species, acquiring information to be labeled by adopting text clustering of text confusion, and labeling the information to be labeled to obtain labeled text information; under a pyrrch neural network framework, training an information extraction model of the type-problem by using a meta-learning model and an LSTM model to obtain a meta-model; training the meta-model by using the labeled text information to obtain a text post-structured model of the small sample medical record; and identifying the text information of the A disease species by using the post-text structured model. Through the scheme, the method has the advantages of simple logic, less label quantity, comprehensive coverage, high processing efficiency and the like, and has high practical value and popularization value in the fields of Chinese natural language processing technology and machine learning.

Description

Small sample clinical medical text post-structuring processing method across disease categories

Technical Field

The invention relates to the field of Chinese natural language processing technology and machine learning, in particular to a small sample clinical medical text post-structuring processing method across disease categories.

Background

High-quality clinical medicine science research is not open to high-availability language model support, however, high-availability language models often require large amounts of high-quality markup corpora. Therefore, clinical medical researchers spend a lot of time on organizing patient data, and effective data are marked out from the complicated medical electronic medical record texts through time-consuming and tedious manual marking operations, and the scientific research method is extremely inefficient for medical workers who are busy originally. And the traditional machine learning is not shared in knowledge and poor in model portability.

Therefore, a small sample clinical medical text post-structuring processing method with less labeling quantity, comprehensive coverage and high efficiency across disease categories is urgently needed to be provided.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a small sample clinical medical text post-structured processing method across disease categories, so as to solve the problems in the prior art that it is difficult to obtain data, labeling efficiency is low, coverage is small, and model reuse difficulty is large, and the technical scheme adopted by the present invention is as follows:

a small sample clinical medical text post-structuring processing method across disease categories comprises the following steps:

acquiring small sample text information of A disease species and large sample text information of B disease species, acquiring information to be labeled by adopting text clustering of text confusion, and labeling the information to be labeled to obtain labeled text information; the labeled text information comprises a standard problem list, a target problem list and a small sample labeling corpus;

under a pyrrch neural network framework, training an information extraction model of the type-problem by using a meta-learning model and an LSTM model to obtain a meta-model;

training the meta-model by using the labeled text information to obtain a text post-structured model of the small sample medical record;

and identifying the text information of the A disease species by using the post-text structured model.

Further, the method for acquiring the small sample text information of the disease category A and the large sample text information of the disease category B and acquiring the information to be labeled by adopting text clustering of text confusion comprises the following steps:

respectively acquiring small sample text information of A disease species and large sample text information of B disease species;

standardizing symbols of the text information of the small sample of the disease A and the text information of the large sample of the disease B, and segmenting according to paragraphs, sentences and text types to obtain segmented text data;

converting the segmented text data into binary data to obtain binary data;

combining the BERT model, and training one by using binary data according to the A disease species and the B disease species in sequence to obtain a BERT language model;

solving the confusion degree of the text information of the small sample of the disease A and the text information of the large sample of the disease B by using a tensoflow frame, and filtering sentences larger than a preset threshold value to form a difference set;

solving the local vector of any sentence in the difference set by using a BERT language model;

and clustering the local vectors by adopting a hierarchical clustering algorithm to obtain the information to be marked.

Furthermore, the LSTM model adopts an input gate, a forgetting gate and an output gate which are connected in sequence.

Further, the forgetting gate satisfies the following relationship:

f_t＝σ(W_f·[h_t-1，x_t]+b_f)

wherein h is_t-1Represents the output of the last cell, x_tRepresents the output of the current cell, σ represents the activation function, W_fWeight matrix representing forgetting gate, b_fA bias term representing a forgetting gate.

Still further, the inputs satisfy the following relationship:

wherein f is_tFor the output of a forgetting gate, i.e. the information that the model would discard from the cell state, σ represents the activation function, C_t-1Indicating old cell status, i_tFor input gate gating, i.e. to control what previously learned needs to be kept at the current moment,

indicating what was learned at the current time;

i is described_tThe expression of (a) is:

i_t＝σ(W_i·[h_t-1，x_t]+b_i)

where σ denotes the activation function, W_iWeight matrix, h, representing input gate gating_t-1Represents the output of the last cell, x_tRepresenting the output of the current cell, b_iRepresenting the bias term of the input gate.

The above-mentioned

The expression of (a) is:

wherein tanh represents the activation function, W_cRepresenting the weight matrix, h, when learning new knowledge_t-1Represents the output of the last cell, x_tRepresenting the output of the current cell, b_cRepresenting biased terms when learning new knowledge.

Still further, the output gate satisfies the following relationship:

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

h_t＝o_t*tanh(C_t)

wherein, W₀Weight matrix representing output gates, b₀Indicating an offset term of the output gate, o_tIndicating the state of the cells that need to be exported.

Furthermore, the information to be labeled is labeled, including the question, the question type and the unique identifier.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention skillfully adopts the confusion degree of the small sample text information of the A disease species and the large sample text information of the B disease species to obtain the clustering sentence vector, and has the advantages that; by reusing the historical model, the labeling quantity is reduced, and the working efficiency is improved.

(2) In the invention, under the pyrrch neural network framework, the information extraction model of the type-I problem is trained by using the meta-learning and LSTM models to obtain the meta-model, and the advantages are that: the model can guide the learning of a new task by using the prior knowledge and experience, so that the model has learning ability and the model training efficiency is improved.

(3) The invention has the advantages that the input gate, the forgetting gate and the output gate are arranged in the LSTM model, and the LSTM model has the following advantages: by introducing the concept of cell state, the LSTM network can delete and add information to the cell state through the structure of various gates, thereby solving the problem of long dependence.

(4) The invention utilizes Chinese natural language processing and machine learning technology, combines writing specifications and experiences of clinical medical texts, and realizes automatic extraction of structured data from small sample clinical medical texts of different disease categories, and the invention mainly provides an information processing tool for clinical scientific research integration, solves the problems of difficult data acquisition, low labeling efficiency, small coverage, large model multiplexing difficulty and the like in the current clinical scientific research, and improves the utilization rate of clinical scientific research data and the efficiency of model training;

in conclusion, the method has the advantages of simple logic, less label quantity, comprehensive coverage, high processing efficiency and the like, and has high practical value and popularization value in the fields of Chinese natural language processing technology and machine learning.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of protection, and it is obvious for those skilled in the art that other related drawings can be obtained according to these drawings without inventive efforts.

FIG. 1 is a logic flow diagram of the present invention.

FIG. 2 is a flow chart of Encode of the Transformer of the present invention.

FIG. 3 is a schematic diagram of the BERT model structure of the present invention.

FIG. 4 is a schematic diagram of the input components of the BERT model of the present invention.

FIG. 5 is a diagram of an LSTM network of a neural network layer in accordance with the present invention.

Fig. 6 is a diagram of the LSTM network of the four neural network layers of the present invention.

FIG. 7 is a schematic drawing of a line graph in the LSTM network of the present invention.

FIG. 8 is a schematic diagram of the structure of cells in the LSTM network of the present invention.

Fig. 9 is a schematic view of the gate selection information passage in the present invention.

Fig. 10 is a schematic view of the forgetting door of the present invention.

FIG. 11 is a schematic view of an input gate of the present invention.

Fig. 12 is a schematic diagram of an output gate of the present invention.

Detailed Description

To further clarify the objects, technical solutions and advantages of the present application, the present invention will be further described with reference to the accompanying drawings and examples, and embodiments of the present invention include, but are not limited to, the following examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Examples

As shown in fig. 1 to fig. 12, the present embodiment provides a post-structured processing method of a small sample clinical medical text across disease categories, which mainly includes the following steps:

the method comprises the steps of firstly, obtaining small sample text information of A disease species and large sample text information of B disease species, adopting text clustering of text confusion to obtain information to be labeled, labeling the information to be labeled, and obtaining labeled text information. Specifically, the method comprises the following steps:

(1) text preprocessing: standardizing the symbols of the text of each medical record of the disease category, and segmenting according to paragraphs, sentences and text types;

(2) text binarization: binarizing text data of different disease species according to a google-BERT standard method;

(3) model training: training language models one by one according to disease types by using a standard BERT model; in this embodiment, BERT is called Bidirectional Encoder replication from Transformers, which uses a Transformer Encoder as a main model structure, and the Transformer discards the recurrent network structure of RNN, and models a text segment completely based on attention mechanism. The core idea of attention mechanism is to calculate the correlation between each word in a sentence and all words in the sentence, and then consider the correlation between these words to reflect the relevance and importance degree between different words in the sentence to some extent. BERT is an unsupervised, deep bi-directional NLP pre-training system, "unsupervised" meaning that it can be trained with only a corpus of plain text, however, unlike traditional language models, BERT does not predict the most likely current word in the context of the current word, but rather randomly masks some words and predicts with all the words that are not masked; as shown in fig. 2 and 3;

the main innovation points of the BERT model are all on a pre-training method, and the pre-training comprises two tasks: mask Language Model and Next sequence Prediction;

the Masked Language Model can be understood as filling in the blank, and the author will randomly mask a fixed number of words in a sentence, and predict the words by using the context;

the aim of the task is that because QA and NLI tasks in natural languages need to understand the relationship between two sentences, the Next sequence Prediction can enable a pre-trained model to be better adapted to the task;

in fig. 4, the input part of BERT is a linear sequence, two sentences are divided by separators, and two symbols [ CLS ] and [ SEP ] are added at the top and at the end, respectively. There are three embeddings per word, respectively: segment Embeddings, Position Embeddings, Token Embeddings;

token Embeddings represent the word vector for each word;

segment Embeddings represent to which sentence the word of each sequence belongs;

position entries represents the encoding of Position information for each word;

(4) the confusion degree is: adopting a tensoflow frame to realize the calculation of the text confusion degree; the specific labeling steps are as follows:

(41) cross-comparing the text information of the small sample of the disease A species with the text information of the large sample of the disease B species with PPL (confusion) of a sentence in a language model of other disease species;

(42) setting a threshold value, and filtering out sentences with large difference among disease seeds; typically, the threshold is set to 0.9.

(5) Sentence vector: and calling a BERT language model of the disease species to obtain a BERT sentence vector corresponding to each sentence.

(6) Hierarchical clustering: and clustering the sentence vectors by a hierarchical clustering method, so as to pack the sentences expressing the same meaning together for the marking of doctors.

In this embodiment, the label of the electronic medical record includes the following contents: qid, query _ type, context, query, ans _ span; wherein qid represents a unique identifier of each custom question; the query _ type represents the type of question, two types in total: class, text; context represents a text to be marked; the query represents a question proposed for a text to be labeled; ans, the answer of the text to be labeled to the question is represented, wherein, the answer of the class type question is of a Boolean type, and the answer of the text type question is of a character string type; ans _ span represents the coordinates of the answer to the text type question in the text to be annotated.

In the embodiment, data to be structured is imported, and data is labeled and stored in excel;

when labeling data, the format requirements are as follows:

(1) each sentence needs to correspond to at least one standard question: the query comprises a query, a query _ type and a qid;

(2) each question corresponds to a unique answer: wherein "0" means "no" and "1" means "yes";

(3) for text type questions, the position of the original text corresponding to the answer needs to be marked, and the position is shown by middle brackets: [ starting position, terminating position ].

And secondly, training an information extraction model of the type-finding problem by using a meta-learning model and an LSTM model under a pytorch neural network framework to obtain a meta-model.

In this embodiment, the LSTM is called Long Short Term Memory, and is designed specifically to solve the Long-Term problem, and all RNNs have a chain form of repeating neural network modules. In the standard RNN, this repeated structure block has only a very simple structure, such as a tanh layer; as shown in FIG. 5;

LSTM is also such a structure, but the duplicated modules have a different structure. Unlike a single neural network layer, here four, interacting in a very specific way, see fig. 6, 7;

in fig. 7, each black line carries an entire vector from the output of one node to the input of the other node. Circles represent operations of poitwise, such as the sum of vectors, while rectangular matrices are learned neural network layers, lines together represent the concatenation of vectors, separate lines represent that content is copied and then distributed to different locations;

LSTM core idea

The key to LSTM is the state of the cell as a whole (fig. 8 shows a cell), and the horizontal line that passes through the cell;

the cell state is similar to the conveyor belt. Run directly on the whole chain with only a few linear interactions. It is easy for information to remain unchanged in the stream above.

If only the upper horizontal line is available, there is no way to add or delete information. But rather by a structure called gates.

The gate may be implemented to selectively pass information, primarily through a sigmoid neural layer and a point-by-point multiplication operation, see fig. 9.

Each element of the sigmoid layer output (which is a vector) is a real number between 0 and 1, representing the weight (or duty) to let the corresponding information pass. For example, 0 means "not to pass any information", and 1 means "to pass all information".

LSTM achieves protection and control of information through three such architectures. The three gates are respectively an input gate, a forgetting gate and an output gate.

Forgetting door

The first step in LSTM is to decide what information to discard from the cell state. This decision is made through a so-called forgetting gate level.

f_t＝σ(W_f·[h_t-1，x_t]+b_f)

The door will read h_t-1And x_tOutputting a value between 0 and 1 to each of the cells in the cell state C_t-1The numbers in (1). 1 means "complete retention" and 0 means "complete discard".

In this case, the cell state may include the sex of the current subject, so that the correct pronouns can be selected. When a new subject is seen, the old subject is expected to be forgotten, which is particularly shown in fig. 10;

wherein h is_t-1The output of the last cell, x, is shown_tThe input of the current cell is shown. σ denotes a sigmod function.

Input gate

The next step is to decide how much new information to add to the cell state. This need is accomplished in two steps: firstly, a sigmoid layer called 'inputgatelayer' determines which information needs to be updated; a tanh layer generates a vector, i.e. the content to be updated, C, which is selected as an alternative_t。

indicating what was learned at the current time;

i is described_tThe expression of (a) is:

i_t＝σ(W_i·[_ht-1，x_t]+b_i)

The above-mentioned

The expression of (a) is:

The input gate is used for comparing the old state with the f_tMultiplying, discarding the information to be forgotten, and adding

This is the new cell state candidate.

In the next step, the two parts are combined to perform an update on the state of the cell. Now renewing the old cell stateTime C_t-1Is updated to C_t. The previous steps have already decided what to do and are now actually done.

C, taking the old state with f_tMultiplying and discarding the information which is determined to need to be discarded. Then it is added with it C-t. This is the new candidate, which changes according to the degree of decision to update each state.

In the example of a language model, this is where the gender information for the old pronouns is actually discarded and new information is added based on the previously determined objectives, see FIG. 11 in particular.

Output gate

Finally, it needs to be determined what value to output. This output will be based on the cell state, but is also a filtered version. First, a sigmoid layer is run to determine which part of the cell state will be output. The cell state is then processed through tanh (to obtain a value between-1 and 1) and multiplied by the output of the sigmoid gate, and only that part of the determined output will be output.

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

h_t＝o_t*tanh(C_t)

And thirdly, training the meta-model by using the labeled text information to obtain a text post-structured model of the small sample medical record.

And fourthly, text information of the A disease species is identified by using the post-text structured model. Therefore, the small sample can be adopted for training, and the identification of the text information of the A disease type is realized.

The above-mentioned embodiments are only preferred embodiments of the present invention, and do not limit the scope of the present invention, but all the modifications made by the principles of the present invention and the non-inventive efforts based on the above-mentioned embodiments shall fall within the scope of the present invention.

Claims

1. A small sample clinical medical text post-structuring processing method across disease categories is characterized by comprising the following steps:

2. The method for post-structured processing of clinical medical text of small samples of cross-disease category according to claim 1, wherein the obtaining of small sample text information of disease category A and large sample text information of disease category B and the obtaining of information to be labeled by text clustering of text confusion comprises the following steps:

converting the segmented text data into binary data to obtain binary data;

3. The method for post-structured processing of small sample clinical medical text across disease categories according to claim 1 or 2, wherein the LSTM model employs an input gate, a forgetting gate and an output gate connected in sequence.

4. The method of claim 3, wherein the forgetting gate satisfies the following relationship:

f_t＝σ(W_f·[h_t-1，x_t]+b_f)

wherein h is_t-1Represents the output of the last cell, x_tRepresents the output of the current cell, σ represents the activation function, W_fWeight matrix representing forgetting gate, b_fBiasing term for indicating a forgetting gate

5. The method of claim 3, wherein the input gates satisfy the following relationships:

indicating what was learned at the current time;

i is described_tThe expression of (a) is:

i_t＝σ(W_i·[h_t-1，x_t]+b_i)

where σ denotes the activation function, W_iWeight matrix, h, representing input gate gating_t-1Represents the output of the last cell, x_tRepresenting the output of the current cell, b_iA bias term representing input gate gating;

the above-mentioned

The expression of (a) is:

6. The method of claim 4, wherein the output gate satisfies the following relationship:

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

h_t＝o_t*tanh(C_t)

7. The method of claim 1, wherein labeling information to be labeled, including question, question type and unique identifier, is performed.