CN116737924B

CN116737924B - Medical text data processing method and device

Info

Publication number: CN116737924B
Application number: CN202310478699.1A
Authority: CN
Inventors: 李琴; 杨斌; 文治中; 宋黎晓
Original assignee: Baiyang Intelligent Technology Group Co ltd
Current assignee: Baiyang Intelligent Technology Group Co ltd
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2024-06-25
Anticipated expiration: 2043-04-27
Also published as: CN116737924A

Abstract

The invention relates to a medical text data processing method and a device, wherein the method comprises the following steps: extracting a data set according to the collected public medical information to finely tune a Chinese medical pre-training model MC-BERT so as to obtain a relatively robust language model; dividing an input text into word element sets with the length of N by a word segmentation mode based on word granularity, constructing a token span matrix with the length of N, predicting the head and tail positions of medical entities according to the matrix, and identifying a text range corresponding to the entities; and sending the entity pairs with the medical relations into a fusion distance sensing multi-relation classifier, finally determining the medical entity relations, and outputting a structured result. The invention utilizes the natural language understanding technology based on deep learning, reads and understands medical texts through a machine, and automatically extracts a large number of professional medical entities and relations, thereby remarkably improving the efficiency and quality of medical clinical scientific research and having great significance for constructing a special hospital database.

Description

Medical text data processing method and device

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a method and a device for processing medical texts by using an artificial intelligence technology.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, al) refers to the intelligence exhibited by machines made by humans. Artificial intelligence generally refers to intelligence implemented by a general computer. Artificial intelligence includes weak artificial intelligence and strong artificial intelligence. Weak artificial intelligence (also known as narrow artificial intelligence) is generally understood to mean artificial intelligence technology that is focused on solving a problem in a particular field, and may also be considered as a technical tool applied to that field.

Natural language processing technology is an important branch of narrow artificial intelligence, focuses on processing and application of natural language, and has been widely applied in man-machine interaction. The category of natural language processing comprises the fields of information retrieval, information extraction, machine translation, text reading, word segmentation, part-of-speech tagging, automatic abstracting and the like.

In the practical application in the field of big data of health care, the medical record described by a doctor by using natural language can be analyzed by using word segmentation and labeling in the natural language processing technology, and information such as symptoms, diagnosis and treatment information, events and the like of a patient can be extracted from the medical record. The acquisition and standardization of the information play an important role in the clinical scientific research of doctors, the construction of an artificial intelligence auxiliary diagnosis and treatment system and other applications.

The medical text data contains abundant medical information, the structuring of the medical text is to carry out structuring analysis on irregular medical texts represented by electronic medical records and test reports, and the machine is enabled to automatically extract key information wanted by a user from language texts by combining clinical medical entity concepts. The information is helpful to support application scenes such as clinical academic research, medical knowledge graph construction, clinical auxiliary decision making and the like. However, vast amounts of medical text are not understandable, computationally nor computable to the machine, and such data, due to its complexity and expertise, requires medical researchers to expend significant effort in extracting valid information from the text. In order to more efficiently use these data and accurately extract information from medical text, a technology for structuring medical text is urgently needed.

In the existing scheme, the entity and relation recognition of the medical text is mainly carried out by using an entity relation joint extraction model, generally, the entity recognition task and the entity relation extraction task are jointly modeled, and the entity triples with the relation are directly obtained by sharing parameters of the model through a shared encoder. The scheme generally adopts BiLSTM or Chinese pre-training BERT to encode the text, ignores the importance of using a pre-training model to make Domain Transfer (Domain Transfer) by using a medical text, and the language model obtained based on a large amount of medical corpus fine tuning contains abundant medical priori knowledge, and has better feature expression capability than the pre-training model obtained based on general corpus training. Secondly, such schemes often ignore medical entity nesting situations, for example, a "right lung occupation" represents a lesion type, a "right lung" in a "right lung occupation" represents a body part, and two different types of entities have nesting relations, so that the existing scheme fails under the condition of nested entities. As for medical relationship identification, the existing schemes have poor flexibility, and the relationship classifier cannot be quickly customized according to different relationship modes, which restricts the expansibility of the model.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a medical text structuring method and device, which are used for accurately extracting key information from medical texts to form structured data by combining a medical pre-training model and a distance perception-based relation classifier by utilizing a natural language understanding technology.

In order to achieve the above object, the present invention provides a medical text structuring method, comprising the steps of:

According to the acquired public medical information, extracting data to construct a training set, and finely adjusting a Chinese medical pre-training model MC-BERT to complete Domain Transfer (Domain Transfer) of parameters;

Dividing a clinical medical text into words based on the fine-tuned MC-BERT to obtain a word element set with the length of N, constructing a span matrix with the length of N, wherein N is a natural number, sending the divided medical text into the MC-BERT to obtain a coding vector, judging a text range corresponding to a medical entity by using the start-stop positions of the matrix, and extracting the medical entity;

and (3) carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.

And fusing the extracted medical entity and the medical entity relationship.

Preferably, the disclosed medical information extraction data set is a CHIP2020 Chinese medical text named entity identification, chinese medical entity relation extraction data set, a CCKS2020 medical named entity identification, a medical entity and attribute extraction data set.

Preferably, the method for fine tuning the Chinese medical pre-training model comprises the following steps: all the public medical information extraction data sets are subjected to sequence labeling based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. When other types of entity Type-b are nested in a certain Type of medical entity Type-a, two types of entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, and a new entity Type label Type-a|type-b is generated. And fine-tuning the MC-BERT by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so as to obtain a new language model after the field migration.

Preferably, preprocessing clinical medical text data, cleaning and cutting long texts; dividing words by adopting a dictionary file of a BERT model, obtaining a token set with the length of N, constructing a word element matrix span with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; and obtaining the entity type logic score of the text fragment corresponding to the span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.

Preferably, the noted effective entities make a determination of relationships between entities by the following formula:

Where M represents the total number of entity relationship categories, p _i represents the context vector represented by the ith entity pair, d _i represents the relative distance feature vector between the ith entity pair, and the character ° represents the vector concatenation operation.

Preferably, the context vector represented by the entity pair is:

In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X _N. The method further comprises the steps of: by constructing positive and negative samples to guide a model to learn the implicit relation between the medical entity pairs, the model is ensured to only judge the entity pairs with the fact medical relation.

Preferably, the relative distance feature vector between the entity pairs is:

d_i＝Linear(|s_i2-e_i1|) (3)

in the formula, s _i2、e_i1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer.

Preferably, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing entity pairs with medical relation in { head entity-medical relation, tail entity } format, visualizing and storing independently existing medical entities in { entity type, entity value } format.

The invention also provides a medical text structuring device, which comprises:

The data preprocessing module is used for cleaning and processing the input medical text;

the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;

the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;

The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;

compared with the prior art, the invention has the advantages and positive effects that:

The invention provides a medical text structuring method, which focuses on the feature extraction capability of a pre-training language model on text, adopts a medical information extraction dataset to identify a named entity as an entry point for fine-tuning a Chinese medical pre-training model aiming at the structural task characteristics of the medical text, and realizes the field adaptation of the language model. After the fine-tuned pre-training model is obtained, the entity label is encoded based on a token span matrix mode, so that the identifiable nested entity is ensured; based on the entity relation classifier of distance perception, the context relation among the entities is learned, and only entity pairs with the fact medical relation can be judged by constructing a positive and negative sample assurance model; and the structured content is output through the result fusion of the two stages, so that the data utilization efficiency of clinical medical texts is improved.

Drawings

FIG. 1 is a flow chart of a method of structuring medical text in accordance with an embodiment of the present invention;

FIG. 2 is a block diagram of a medical text structuring method device according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a BIOES coding mode according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a word element matrix entity tag according to an embodiment of the present invention;

Detailed Description

The various aspects of the invention are described in detail below with reference to the drawings and detailed description. It will be apparent that the described embodiments are some, but not all, examples of embodiments of the invention. Elements, structures, and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.

The medical text structuring method of the embodiment of the invention, as shown in fig. 1, comprises the following steps:

step S1, extracting a data set from the collected public medical information to fine tune a Chinese medical pre-training model mcBERT by using a named entity recognition task to obtain a domain-adaptive pre-training language model; specifically, before "fine tuning chinese medical pre-training model mcBERT", it includes:

The disclosed medical information extraction data set is a Chinese medical text naming entity identification and Chinese medical entity relation extraction data set in CHIP2020, a CCKS2020 medical naming entity identification and medical entity and attribute extraction data set.

And (3) sequence labeling is carried out on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the start of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. Labeling entity type tags mainly comprises: specific parts (Body part) of affected parts, obvious patient indexes (Symptom) or not, growth and development indexes (BMI), specific positions (direction) of the affected parts, disease names (Disease), whether sampling data (Sample) exist or not, disease progress conditions (Change), attribute characteristics (Feature), stimulus elements (INCENTIVE), time (Time) and Disease stage (device), wherein the type of a symptom marked entity can be added with a-number in front to indicate that the patient does not have the symptom or the sign, and the relation among the entities is expressed in an ordered pair mode. The method steps for obtaining symptoms and attributes using BIOES are as follows:

Extracting the entity of the medical information by adopting a collected named entity identification and relation extraction technology of the public medical information, and marking negative symptoms;

Determining the attribute corresponding to the entity by taking the specific part of the affected part, obvious patient indexes, growth and development indexes and sampling data as the entity;

extracting specific position and attribute characteristics of an affected part based on the presence or absence of obvious affected part indexes;

Based on the presence or absence of obvious patient indexes, extracting time, sampling data, the stage of the disease, the progress condition of the disease and stimulation elements;

Extracting the progress of the disease and the stimulation factors based on the presence or absence of obvious disease indexes;

extracting attribute features and stimulus elements based on whether sampling data exists or not;

and merging and de-duplication processing is carried out on the extracted entities and attributes.

Specifically, in the actual labeling process, when other types of entity Type-b are nested in a certain Type of medical entity Type-a, two entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, so that a new entity Type label Type-a|type-b is generated. For example, as shown in fig. 3, the text "two lung nodules" in patient two lung nodules "represents the lesion entity type," two lung "represents the site entity type, so when labeled" two lung "it is combined with labels" B-site |b-lesion, E-site|i-lesion ".

MC-BERT is a natural language understanding model BERT trained on large-scale Chinese medical corpora such as Chinese medical questions and answers, chinese medical encyclopedia, chinese electronic medical records, and the like, and a great deal of medical knowledge has been explicitly injected into the model. And then, the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through unified sequence marked data, so that a new language model after the field migration can be obtained, and the model is more suitable for an information extraction task.

S2, preprocessing clinical medical text data, cleaning and cutting long texts; performing word segmentation by adopting a self-contained vocabolar dictionary of the BERT model to obtain a word element set with the length of N and constructing a span matrix with the length of N for encoding the entity tag; and obtaining the entity type logic score of the text fragment corresponding to the span matrix by using the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.

Specifically, preprocessing clinical medical text data, removing illegal messy code characters, and if the text length is greater than the upper limit 512 supported by BERT, cutting a long text by taking 512 as the length to obtain a plurality of data paragraphs; based on the BERT self-contained file named vocab.txt, character granularity is adopted to segment Chinese characters appearing in medical texts, the medical English characters and the numerals are segmented in a sub-word mode, a word element set with the length of N obtained after segmentation is used for constructing a span matrix with the length of N, the span matrix covers segment arrangement of all cases of input texts, and the condition that entities are nested is guaranteed not to appear any more. For example, the text "right lung occupation" shown in fig. 4 is segmented to construct a token span matrix of 4*4, and [0] [1] in span [0] [1] = bod represents the start-stop range of the text corresponding to the matrix, namely "right lung", and the actual type is "body"; in span [0] [3] =dis, [0] [3] represents the start-stop range of the text corresponding to the matrix, namely "right lung occupation", the actual type is "dis", and other non-solid parts are set to 0. Obtaining a word element set coding vector X _N by using the fine-tuned MC-Bert as embedding mode and obtaining the word element set coding vector through nonlinear transformationAnd/>The inner product of the two is taken as logits value of the span matrix to evaluate the entity type score of the text segment corresponding to span [ start ] [ end ], and the score is larger than a threshold alpha which is set to 0.5 based on experience as a valid entity.

And S3, carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.

Specifically, the marked medical entities are constructed into a training set in a pair mode, the entity pairs with the fact medical relation are defined as positive samples, the entity pairs without the medical relation are defined as negative samples after being randomly sampled, and the model is ensured to only judge the entity pairs with the fact medical relation. The entity pair performs relationship determination between entities through the following formula:

The context vector represented by the entity pair is:

In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X _N.

The relative distance feature vector between the entity pairs is:

d_i＝Linear(|s_i2-e_i1|) (3)

in the formula, s _i2、e_i1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer. The mapped position vector and entity vector keep dimension consistent, and feature fusion is completed in a cascading mode.

And S4, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing the entity pair with the medical relation in a { head entity-medical relation and tail entity } format, and visualizing and storing the independently existing medical entity in a { entity type and entity value } format. For example, the text "patient shows double lung nodule in 1 month CT examination in 2020" will be extracted after steps S2 and S3 (date, 1 month in 2020), (examination means, CT), (lesion, double lung nodule), wherein the relationship between "date" and "examination means" is "examination date" and is formatted as: { CT-date of examination, month 1 in 2020 }; wherein the entity "lesion" exists independently and is not in medical relation to other entities, formatted as: { lesions, double lung nodules }.

In summary, the invention provides a medical text structuring method, which can automatically perform structuring extraction on an input medical text to obtain a large number of professional medical entities and relations, and remarkably improve the efficiency and quality of medical clinical scientific research.

Example 2: referring to fig. 2, the present embodiment provides a medical text structuring apparatus. The functional models are described in detail as follows:

Specifically, the medical entity extraction module uses a medical pre-training model MC-BERT after domain migration as embedding to judge whether the text range corresponding to the token span matrix index is a predefined medical entity or not;

Specifically, the medical entity relation extraction module is used for training the model by constructed positive and negative sample pairs, integrating entity position feature vectors in the learning process, and carrying out relation identification among entities by using a multi-classifier.

further, the medical text structuring apparatus further comprises: and the labeling module is used for labeling entities and relations of the clinical medical text data.

The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made to the present invention within the spirit of the present invention and the scope of the appended claims should be construed as falling within the scope of the present invention.

Claims

1. A medical text data processing method, the method comprising:

extracting a data set according to the acquired public medical information to construct a training set, and fine-tuning a Chinese medical pre-training model MC-BERT to finish domain migration of parameters;

Dividing a clinical medical text based on the fine-tuned MC-BERT to obtain a word element token set with the length of N, constructing a N-N matrix, wherein N is a natural number, then sending the divided medical text into the MC-BERT to obtain a coding vector, reversely pushing out a text range corresponding to a medical entity by using the position coordinates of the matrix, and extracting the medical entity;

Based on the multi-classifier of the full-connection layer, carrying out relationship discrimination on entity pairs with medical relationships, and extracting medical entity relationships;

performing result fusion on the extracted medical entity and the medical entity relationship;

The method for fine tuning the Chinese medical pre-training model comprises the following steps: performing sequence labeling on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type; when nesting other types of entity types-b in a certain Type of medical entity Type-a, combining two types of entity types with nesting relationship in pairs by adopting a mode of combining label layers to generate a new entity Type label Type-a|type-b; the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so that a new language model after the field migration is obtained;

The extraction medical entity comprises the following specific steps: preprocessing clinical medical text data, cleaning and cutting long text; the method comprises the steps of performing word segmentation by adopting a dictionary file of a BERT model, obtaining a word element set with the length of N, constructing a span matrix with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; obtaining an entity type logic score of a text fragment corresponding to span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than a threshold alpha and is regarded as an effective entity;

and determining the relationship between the marked effective entities through the following formula:

Wherein M represents the total number of entity relation categories, p _i represents the context vector represented by the ith entity pair, d _i represents the relative distance feature vector between the ith entity pair, and the character degrees represent the vector cascading operation;

The context vector represented by the entity pair is:

In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And representing the head and tail feature vectors of the tail entity in the ith entity pair, wherein the feature vectors are obtained from token set coding vectors X _N, and the model is guided to learn the implicit relation between the medical entity pairs by constructing positive and negative samples, so that the model can only judge the entity pairs with the fact medical relation.

2. The method of claim 1, wherein the public medical information extraction dataset is a CHIP2020 chinese medical text named entity identification, chinese medical entity relationship extraction dataset, a CCKS2020 medical named entity identification, a medical entity and attribute extraction dataset.

3. A medical text data processing method according to claim 1, wherein labeling entity type tags mainly comprises: the specific part Body part of the affected part, obvious patient indexes Symptom, growth and development indexes BMI, specific position direction of the affected part, disease name Disease, sampling data Sample, disease progress Change, attribute Feature, stimulus element INCENTIVE, time and Disease stage Degree, wherein the type of a symptom marking entity is added with a-number in front to indicate that the patient does not have the symptom or the sign, the relation among the entities is expressed in an orderly pair mode, and the specific marking method comprises the following steps:

4. The method of claim 1, wherein the relative distance feature vector between the pair of entities is:

d_i＝Linear(|s_i2―e_i1|)

In the formula, s _i2、e_i1 represents feature vectors of the ith entity in BERT position coding of the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after subtracting absolute values, and a Linear (·) function represents further nonlinear mapping of the position vectors of the entity pair through a full connection layer.

5. The medical text data processing method according to claim 1, wherein the extracted medical entity and medical entity relationship are traversed, the medical entity with overlong text is removed, the entity pair with the medical relationship is visualized and stored in { head entity-medical relationship, tail entity } format, and the independently existing medical entity is visualized and stored in { entity type, entity value } format.

6. A medical text data processing apparatus, comprising:

the apparatus performs and implements the medical text data processing method according to any one of claims 1 to 5.