CN116737924B - Medical text data processing method and device - Google Patents
Medical text data processing method and device Download PDFInfo
- Publication number
- CN116737924B CN116737924B CN202310478699.1A CN202310478699A CN116737924B CN 116737924 B CN116737924 B CN 116737924B CN 202310478699 A CN202310478699 A CN 202310478699A CN 116737924 B CN116737924 B CN 116737924B
- Authority
- CN
- China
- Prior art keywords
- entity
- medical
- text
- type
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 26
- 239000011159 matrix material Substances 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 21
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 38
- 238000000605 extraction Methods 0.000 claims description 27
- 201000010099 disease Diseases 0.000 claims description 16
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 16
- 208000024891 symptom Diseases 0.000 claims description 10
- 238000004140 cleaning Methods 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 7
- 239000012634 fragment Substances 0.000 claims description 6
- 238000013508 migration Methods 0.000 claims description 5
- 230000005012 migration Effects 0.000 claims description 5
- 230000008447 perception Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000011161 development Methods 0.000 claims description 4
- 230000018109 developmental process Effects 0.000 claims description 4
- 230000000638 stimulation Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000011160 research Methods 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 abstract 1
- 238000013473 artificial intelligence Methods 0.000 description 12
- 210000004072 lung Anatomy 0.000 description 8
- 206010056342 Pulmonary mass Diseases 0.000 description 5
- 230000003902 lesion Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention relates to a medical text data processing method and a device, wherein the method comprises the following steps: extracting a data set according to the collected public medical information to finely tune a Chinese medical pre-training model MC-BERT so as to obtain a relatively robust language model; dividing an input text into word element sets with the length of N by a word segmentation mode based on word granularity, constructing a token span matrix with the length of N, predicting the head and tail positions of medical entities according to the matrix, and identifying a text range corresponding to the entities; and sending the entity pairs with the medical relations into a fusion distance sensing multi-relation classifier, finally determining the medical entity relations, and outputting a structured result. The invention utilizes the natural language understanding technology based on deep learning, reads and understands medical texts through a machine, and automatically extracts a large number of professional medical entities and relations, thereby remarkably improving the efficiency and quality of medical clinical scientific research and having great significance for constructing a special hospital database.
Description
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to a method and a device for processing medical texts by using an artificial intelligence technology.
Background
Artificial intelligence (ARTIFICIAL INTELLIGENCE, al) refers to the intelligence exhibited by machines made by humans. Artificial intelligence generally refers to intelligence implemented by a general computer. Artificial intelligence includes weak artificial intelligence and strong artificial intelligence. Weak artificial intelligence (also known as narrow artificial intelligence) is generally understood to mean artificial intelligence technology that is focused on solving a problem in a particular field, and may also be considered as a technical tool applied to that field.
Natural language processing technology is an important branch of narrow artificial intelligence, focuses on processing and application of natural language, and has been widely applied in man-machine interaction. The category of natural language processing comprises the fields of information retrieval, information extraction, machine translation, text reading, word segmentation, part-of-speech tagging, automatic abstracting and the like.
In the practical application in the field of big data of health care, the medical record described by a doctor by using natural language can be analyzed by using word segmentation and labeling in the natural language processing technology, and information such as symptoms, diagnosis and treatment information, events and the like of a patient can be extracted from the medical record. The acquisition and standardization of the information play an important role in the clinical scientific research of doctors, the construction of an artificial intelligence auxiliary diagnosis and treatment system and other applications.
The medical text data contains abundant medical information, the structuring of the medical text is to carry out structuring analysis on irregular medical texts represented by electronic medical records and test reports, and the machine is enabled to automatically extract key information wanted by a user from language texts by combining clinical medical entity concepts. The information is helpful to support application scenes such as clinical academic research, medical knowledge graph construction, clinical auxiliary decision making and the like. However, vast amounts of medical text are not understandable, computationally nor computable to the machine, and such data, due to its complexity and expertise, requires medical researchers to expend significant effort in extracting valid information from the text. In order to more efficiently use these data and accurately extract information from medical text, a technology for structuring medical text is urgently needed.
In the existing scheme, the entity and relation recognition of the medical text is mainly carried out by using an entity relation joint extraction model, generally, the entity recognition task and the entity relation extraction task are jointly modeled, and the entity triples with the relation are directly obtained by sharing parameters of the model through a shared encoder. The scheme generally adopts BiLSTM or Chinese pre-training BERT to encode the text, ignores the importance of using a pre-training model to make Domain Transfer (Domain Transfer) by using a medical text, and the language model obtained based on a large amount of medical corpus fine tuning contains abundant medical priori knowledge, and has better feature expression capability than the pre-training model obtained based on general corpus training. Secondly, such schemes often ignore medical entity nesting situations, for example, a "right lung occupation" represents a lesion type, a "right lung" in a "right lung occupation" represents a body part, and two different types of entities have nesting relations, so that the existing scheme fails under the condition of nested entities. As for medical relationship identification, the existing schemes have poor flexibility, and the relationship classifier cannot be quickly customized according to different relationship modes, which restricts the expansibility of the model.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a medical text structuring method and device, which are used for accurately extracting key information from medical texts to form structured data by combining a medical pre-training model and a distance perception-based relation classifier by utilizing a natural language understanding technology.
In order to achieve the above object, the present invention provides a medical text structuring method, comprising the steps of:
According to the acquired public medical information, extracting data to construct a training set, and finely adjusting a Chinese medical pre-training model MC-BERT to complete Domain Transfer (Domain Transfer) of parameters;
Dividing a clinical medical text into words based on the fine-tuned MC-BERT to obtain a word element set with the length of N, constructing a span matrix with the length of N, wherein N is a natural number, sending the divided medical text into the MC-BERT to obtain a coding vector, judging a text range corresponding to a medical entity by using the start-stop positions of the matrix, and extracting the medical entity;
and (3) carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.
And fusing the extracted medical entity and the medical entity relationship.
Preferably, the disclosed medical information extraction data set is a CHIP2020 Chinese medical text named entity identification, chinese medical entity relation extraction data set, a CCKS2020 medical named entity identification, a medical entity and attribute extraction data set.
Preferably, the method for fine tuning the Chinese medical pre-training model comprises the following steps: all the public medical information extraction data sets are subjected to sequence labeling based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. When other types of entity Type-b are nested in a certain Type of medical entity Type-a, two types of entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, and a new entity Type label Type-a|type-b is generated. And fine-tuning the MC-BERT by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so as to obtain a new language model after the field migration.
Preferably, preprocessing clinical medical text data, cleaning and cutting long texts; dividing words by adopting a dictionary file of a BERT model, obtaining a token set with the length of N, constructing a word element matrix span with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; and obtaining the entity type logic score of the text fragment corresponding to the span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.
Preferably, the noted effective entities make a determination of relationships between entities by the following formula:
Where M represents the total number of entity relationship categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character ° represents the vector concatenation operation.
Preferably, the context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X N. The method further comprises the steps of: by constructing positive and negative samples to guide a model to learn the implicit relation between the medical entity pairs, the model is ensured to only judge the entity pairs with the fact medical relation.
Preferably, the relative distance feature vector between the entity pairs is:
di=Linear(|si2-ei1|) (3)
in the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer.
Preferably, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing entity pairs with medical relation in { head entity-medical relation, tail entity } format, visualizing and storing independently existing medical entities in { entity type, entity value } format.
The invention also provides a medical text structuring device, which comprises:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
compared with the prior art, the invention has the advantages and positive effects that:
The invention provides a medical text structuring method, which focuses on the feature extraction capability of a pre-training language model on text, adopts a medical information extraction dataset to identify a named entity as an entry point for fine-tuning a Chinese medical pre-training model aiming at the structural task characteristics of the medical text, and realizes the field adaptation of the language model. After the fine-tuned pre-training model is obtained, the entity label is encoded based on a token span matrix mode, so that the identifiable nested entity is ensured; based on the entity relation classifier of distance perception, the context relation among the entities is learned, and only entity pairs with the fact medical relation can be judged by constructing a positive and negative sample assurance model; and the structured content is output through the result fusion of the two stages, so that the data utilization efficiency of clinical medical texts is improved.
Drawings
FIG. 1 is a flow chart of a method of structuring medical text in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a medical text structuring method device according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a BIOES coding mode according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a word element matrix entity tag according to an embodiment of the present invention;
Detailed Description
The various aspects of the invention are described in detail below with reference to the drawings and detailed description. It will be apparent that the described embodiments are some, but not all, examples of embodiments of the invention. Elements, structures, and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
The medical text structuring method of the embodiment of the invention, as shown in fig. 1, comprises the following steps:
step S1, extracting a data set from the collected public medical information to fine tune a Chinese medical pre-training model mcBERT by using a named entity recognition task to obtain a domain-adaptive pre-training language model; specifically, before "fine tuning chinese medical pre-training model mcBERT", it includes:
The disclosed medical information extraction data set is a Chinese medical text naming entity identification and Chinese medical entity relation extraction data set in CHIP2020, a CCKS2020 medical naming entity identification and medical entity and attribute extraction data set.
And (3) sequence labeling is carried out on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the start of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. Labeling entity type tags mainly comprises: specific parts (Body part) of affected parts, obvious patient indexes (Symptom) or not, growth and development indexes (BMI), specific positions (direction) of the affected parts, disease names (Disease), whether sampling data (Sample) exist or not, disease progress conditions (Change), attribute characteristics (Feature), stimulus elements (INCENTIVE), time (Time) and Disease stage (device), wherein the type of a symptom marked entity can be added with a-number in front to indicate that the patient does not have the symptom or the sign, and the relation among the entities is expressed in an ordered pair mode. The method steps for obtaining symptoms and attributes using BIOES are as follows:
Extracting the entity of the medical information by adopting a collected named entity identification and relation extraction technology of the public medical information, and marking negative symptoms;
Determining the attribute corresponding to the entity by taking the specific part of the affected part, obvious patient indexes, growth and development indexes and sampling data as the entity;
extracting specific position and attribute characteristics of an affected part based on the presence or absence of obvious affected part indexes;
Based on the presence or absence of obvious patient indexes, extracting time, sampling data, the stage of the disease, the progress condition of the disease and stimulation elements;
Extracting the progress of the disease and the stimulation factors based on the presence or absence of obvious disease indexes;
extracting attribute features and stimulus elements based on whether sampling data exists or not;
and merging and de-duplication processing is carried out on the extracted entities and attributes.
Specifically, in the actual labeling process, when other types of entity Type-b are nested in a certain Type of medical entity Type-a, two entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, so that a new entity Type label Type-a|type-b is generated. For example, as shown in fig. 3, the text "two lung nodules" in patient two lung nodules "represents the lesion entity type," two lung "represents the site entity type, so when labeled" two lung "it is combined with labels" B-site |b-lesion, E-site|i-lesion ".
MC-BERT is a natural language understanding model BERT trained on large-scale Chinese medical corpora such as Chinese medical questions and answers, chinese medical encyclopedia, chinese electronic medical records, and the like, and a great deal of medical knowledge has been explicitly injected into the model. And then, the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through unified sequence marked data, so that a new language model after the field migration can be obtained, and the model is more suitable for an information extraction task.
S2, preprocessing clinical medical text data, cleaning and cutting long texts; performing word segmentation by adopting a self-contained vocabolar dictionary of the BERT model to obtain a word element set with the length of N and constructing a span matrix with the length of N for encoding the entity tag; and obtaining the entity type logic score of the text fragment corresponding to the span matrix by using the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.
Specifically, preprocessing clinical medical text data, removing illegal messy code characters, and if the text length is greater than the upper limit 512 supported by BERT, cutting a long text by taking 512 as the length to obtain a plurality of data paragraphs; based on the BERT self-contained file named vocab.txt, character granularity is adopted to segment Chinese characters appearing in medical texts, the medical English characters and the numerals are segmented in a sub-word mode, a word element set with the length of N obtained after segmentation is used for constructing a span matrix with the length of N, the span matrix covers segment arrangement of all cases of input texts, and the condition that entities are nested is guaranteed not to appear any more. For example, the text "right lung occupation" shown in fig. 4 is segmented to construct a token span matrix of 4*4, and [0] [1] in span [0] [1] = bod represents the start-stop range of the text corresponding to the matrix, namely "right lung", and the actual type is "body"; in span [0] [3] =dis, [0] [3] represents the start-stop range of the text corresponding to the matrix, namely "right lung occupation", the actual type is "dis", and other non-solid parts are set to 0. Obtaining a word element set coding vector X N by using the fine-tuned MC-Bert as embedding mode and obtaining the word element set coding vector through nonlinear transformationAnd/>The inner product of the two is taken as logits value of the span matrix to evaluate the entity type score of the text segment corresponding to span [ start ] [ end ], and the score is larger than a threshold alpha which is set to 0.5 based on experience as a valid entity.
And S3, carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.
Specifically, the marked medical entities are constructed into a training set in a pair mode, the entity pairs with the fact medical relation are defined as positive samples, the entity pairs without the medical relation are defined as negative samples after being randomly sampled, and the model is ensured to only judge the entity pairs with the fact medical relation. The entity pair performs relationship determination between entities through the following formula:
Where M represents the total number of entity relationship categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character ° represents the vector concatenation operation.
The context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X N.
The relative distance feature vector between the entity pairs is:
di=Linear(|si2-ei1|) (3)
in the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer. The mapped position vector and entity vector keep dimension consistent, and feature fusion is completed in a cascading mode.
And S4, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing the entity pair with the medical relation in a { head entity-medical relation and tail entity } format, and visualizing and storing the independently existing medical entity in a { entity type and entity value } format. For example, the text "patient shows double lung nodule in 1 month CT examination in 2020" will be extracted after steps S2 and S3 (date, 1 month in 2020), (examination means, CT), (lesion, double lung nodule), wherein the relationship between "date" and "examination means" is "examination date" and is formatted as: { CT-date of examination, month 1 in 2020 }; wherein the entity "lesion" exists independently and is not in medical relation to other entities, formatted as: { lesions, double lung nodules }.
In summary, the invention provides a medical text structuring method, which can automatically perform structuring extraction on an input medical text to obtain a large number of professional medical entities and relations, and remarkably improve the efficiency and quality of medical clinical scientific research.
Example 2: referring to fig. 2, the present embodiment provides a medical text structuring apparatus. The functional models are described in detail as follows:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
Specifically, the medical entity extraction module uses a medical pre-training model MC-BERT after domain migration as embedding to judge whether the text range corresponding to the token span matrix index is a predefined medical entity or not;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
Specifically, the medical entity relation extraction module is used for training the model by constructed positive and negative sample pairs, integrating entity position feature vectors in the learning process, and carrying out relation identification among entities by using a multi-classifier.
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
further, the medical text structuring apparatus further comprises: and the labeling module is used for labeling entities and relations of the clinical medical text data.
The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made to the present invention within the spirit of the present invention and the scope of the appended claims should be construed as falling within the scope of the present invention.
Claims (6)
1. A medical text data processing method, the method comprising:
extracting a data set according to the acquired public medical information to construct a training set, and fine-tuning a Chinese medical pre-training model MC-BERT to finish domain migration of parameters;
Dividing a clinical medical text based on the fine-tuned MC-BERT to obtain a word element token set with the length of N, constructing a N-N matrix, wherein N is a natural number, then sending the divided medical text into the MC-BERT to obtain a coding vector, reversely pushing out a text range corresponding to a medical entity by using the position coordinates of the matrix, and extracting the medical entity;
Based on the multi-classifier of the full-connection layer, carrying out relationship discrimination on entity pairs with medical relationships, and extracting medical entity relationships;
performing result fusion on the extracted medical entity and the medical entity relationship;
The method for fine tuning the Chinese medical pre-training model comprises the following steps: performing sequence labeling on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type; when nesting other types of entity types-b in a certain Type of medical entity Type-a, combining two types of entity types with nesting relationship in pairs by adopting a mode of combining label layers to generate a new entity Type label Type-a|type-b; the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so that a new language model after the field migration is obtained;
The extraction medical entity comprises the following specific steps: preprocessing clinical medical text data, cleaning and cutting long text; the method comprises the steps of performing word segmentation by adopting a dictionary file of a BERT model, obtaining a word element set with the length of N, constructing a span matrix with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; obtaining an entity type logic score of a text fragment corresponding to span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than a threshold alpha and is regarded as an effective entity;
and determining the relationship between the marked effective entities through the following formula:
Wherein M represents the total number of entity relation categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character degrees represent the vector cascading operation;
The context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And representing the head and tail feature vectors of the tail entity in the ith entity pair, wherein the feature vectors are obtained from token set coding vectors X N, and the model is guided to learn the implicit relation between the medical entity pairs by constructing positive and negative samples, so that the model can only judge the entity pairs with the fact medical relation.
2. The method of claim 1, wherein the public medical information extraction dataset is a CHIP2020 chinese medical text named entity identification, chinese medical entity relationship extraction dataset, a CCKS2020 medical named entity identification, a medical entity and attribute extraction dataset.
3. A medical text data processing method according to claim 1, wherein labeling entity type tags mainly comprises: the specific part Body part of the affected part, obvious patient indexes Symptom, growth and development indexes BMI, specific position direction of the affected part, disease name Disease, sampling data Sample, disease progress Change, attribute Feature, stimulus element INCENTIVE, time and Disease stage Degree, wherein the type of a symptom marking entity is added with a-number in front to indicate that the patient does not have the symptom or the sign, the relation among the entities is expressed in an orderly pair mode, and the specific marking method comprises the following steps:
Extracting the entity of the medical information by adopting a collected named entity identification and relation extraction technology of the public medical information, and marking negative symptoms;
Determining the attribute corresponding to the entity by taking the specific part of the affected part, obvious patient indexes, growth and development indexes and sampling data as the entity;
extracting specific position and attribute characteristics of an affected part based on the presence or absence of obvious affected part indexes;
Based on the presence or absence of obvious patient indexes, extracting time, sampling data, the stage of the disease, the progress condition of the disease and stimulation elements;
Extracting the progress of the disease and the stimulation factors based on the presence or absence of obvious disease indexes;
extracting attribute features and stimulus elements based on whether sampling data exists or not;
and merging and de-duplication processing is carried out on the extracted entities and attributes.
4. The method of claim 1, wherein the relative distance feature vector between the pair of entities is:
di=Linear(|si2―ei1|)
In the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding of the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after subtracting absolute values, and a Linear (·) function represents further nonlinear mapping of the position vectors of the entity pair through a full connection layer.
5. The medical text data processing method according to claim 1, wherein the extracted medical entity and medical entity relationship are traversed, the medical entity with overlong text is removed, the entity pair with the medical relationship is visualized and stored in { head entity-medical relationship, tail entity } format, and the independently existing medical entity is visualized and stored in { entity type, entity value } format.
6. A medical text data processing apparatus, comprising:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
the apparatus performs and implements the medical text data processing method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310478699.1A CN116737924B (en) | 2023-04-27 | 2023-04-27 | Medical text data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310478699.1A CN116737924B (en) | 2023-04-27 | 2023-04-27 | Medical text data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116737924A CN116737924A (en) | 2023-09-12 |
CN116737924B true CN116737924B (en) | 2024-06-25 |
Family
ID=87912216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310478699.1A Active CN116737924B (en) | 2023-04-27 | 2023-04-27 | Medical text data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116737924B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117240916B (en) * | 2023-11-14 | 2024-02-13 | 阿里健康科技(中国)有限公司 | Method for transmitting and storing structured medical data and related device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818676A (en) * | 2021-02-02 | 2021-05-18 | 东北大学 | Medical entity relationship joint extraction method |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4129048B2 (en) * | 2005-06-15 | 2008-07-30 | 松下電器産業株式会社 | Named entity extraction apparatus, method, and program |
US20090249182A1 (en) * | 2008-03-31 | 2009-10-01 | Iti Scotland Limited | Named entity recognition methods and apparatus |
US10133728B2 (en) * | 2015-03-20 | 2018-11-20 | Microsoft Technology Licensing, Llc | Semantic parsing for complex knowledge extraction |
RU2619193C1 (en) * | 2016-06-17 | 2017-05-12 | Общество с ограниченной ответственностью "Аби ИнфоПоиск" | Multi stage recognition of the represent essentials in texts on the natural language on the basis of morphological and semantic signs |
US20180025121A1 (en) * | 2016-07-20 | 2018-01-25 | Baidu Usa Llc | Systems and methods for finer-grained medical entity extraction |
CN107977368B (en) * | 2016-10-21 | 2021-12-10 | 京东方科技集团股份有限公司 | Information extraction method and system |
US20190006027A1 (en) * | 2017-06-30 | 2019-01-03 | Accenture Global Solutions Limited | Automatic identification and extraction of medical conditions and evidences from electronic health records |
US20210375488A1 (en) * | 2020-05-29 | 2021-12-02 | Medius Health | System and methods for automatic medical knowledge curation |
CN112989835B (en) * | 2021-04-21 | 2021-10-08 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Extraction method of complex medical entities |
AU2021106425A4 (en) * | 2021-08-22 | 2021-11-04 | Honghai Feng | Method, system and apparatus for extracting entity words of diseases and their corresponding laboratory indicators from Chinese medical texts |
CN114036934A (en) * | 2021-10-15 | 2022-02-11 | 浙江工业大学 | Chinese medical entity relation joint extraction method and system |
CN114692636B (en) * | 2022-03-09 | 2023-11-03 | 南京海泰医疗信息系统有限公司 | Nested named entity identification method based on relationship classification and sequence labeling |
CN114637852B (en) * | 2022-04-24 | 2023-12-08 | 四川医枢科技有限责任公司 | Entity relation extraction method, device, equipment and storage medium of medical text |
CN115510242A (en) * | 2022-10-04 | 2022-12-23 | 河南科技大学 | Chinese medicine text entity relation combined extraction method |
CN115879473B (en) * | 2022-12-26 | 2023-12-01 | 淮阴工学院 | Chinese medical named entity recognition method based on improved graph attention network |
-
2023
- 2023-04-27 CN CN202310478699.1A patent/CN116737924B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818676A (en) * | 2021-02-02 | 2021-05-18 | 东北大学 | Medical entity relationship joint extraction method |
Non-Patent Citations (1)
Title |
---|
基于BERT的中文电子病历命名实体识别;李灵芳;杨佳琦;李宝山;杜永兴;胡伟健;;内蒙古科技大学学报;20200315(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116737924A (en) | 2023-09-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831559B (en) | Chinese electronic medical record text analysis method and system | |
CN111540468B (en) | ICD automatic coding method and system for visualizing diagnostic reasons | |
CN111078875B (en) | Method for extracting question-answer pairs from semi-structured document based on machine learning | |
CN112002411A (en) | Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN111949759A (en) | Method and system for retrieving medical record text similarity and computer equipment | |
CN113486667B (en) | Medical entity relationship joint extraction method based on entity type information | |
CN111538845A (en) | Method, model and system for constructing kidney disease specialized medical knowledge map | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN112800766A (en) | Chinese medical entity identification and labeling method and system based on active learning | |
CN112241457A (en) | Event detection method for event of affair knowledge graph fused with extension features | |
CN111061882A (en) | Knowledge graph construction method | |
CN116737924B (en) | Medical text data processing method and device | |
CN113903422A (en) | Medical image diagnosis report entity extraction method, device and equipment | |
CN115659947A (en) | Multi-item selection answering method and system based on machine reading understanding and text summarization | |
CN112732863B (en) | Standardized segmentation method for electronic medical records | |
CN113111660A (en) | Data processing method, device, equipment and storage medium | |
CN112749277A (en) | Medical data processing method and device and storage medium | |
CN116719840A (en) | Medical information pushing method based on post-medical-record structured processing | |
CN113130025A (en) | Entity relationship extraction method, terminal equipment and computer readable storage medium | |
Yao et al. | A unified approach to researcher profiling | |
CN116227594A (en) | Construction method of high-credibility knowledge graph of medical industry facing multi-source data | |
CN114780738A (en) | Medical image examination project name standardization method and system based on different application scenes | |
CN113297851A (en) | Recognition method for confusable sports injury entity words | |
Bettouche et al. | Mapping researcher activity based on publication data by means of transformers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |