CN116737924B - Medical text data processing method and device - Google Patents

Medical text data processing method and device Download PDF

Info

Publication number
CN116737924B
CN116737924B CN202310478699.1A CN202310478699A CN116737924B CN 116737924 B CN116737924 B CN 116737924B CN 202310478699 A CN202310478699 A CN 202310478699A CN 116737924 B CN116737924 B CN 116737924B
Authority
CN
China
Prior art keywords
entity
medical
text
type
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310478699.1A
Other languages
Chinese (zh)
Other versions
CN116737924A (en
Inventor
李琴
杨斌
文治中
宋黎晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baiyang Intelligent Technology Group Co ltd
Original Assignee
Baiyang Intelligent Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baiyang Intelligent Technology Group Co ltd filed Critical Baiyang Intelligent Technology Group Co ltd
Priority to CN202310478699.1A priority Critical patent/CN116737924B/en
Publication of CN116737924A publication Critical patent/CN116737924A/en
Application granted granted Critical
Publication of CN116737924B publication Critical patent/CN116737924B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a medical text data processing method and a device, wherein the method comprises the following steps: extracting a data set according to the collected public medical information to finely tune a Chinese medical pre-training model MC-BERT so as to obtain a relatively robust language model; dividing an input text into word element sets with the length of N by a word segmentation mode based on word granularity, constructing a token span matrix with the length of N, predicting the head and tail positions of medical entities according to the matrix, and identifying a text range corresponding to the entities; and sending the entity pairs with the medical relations into a fusion distance sensing multi-relation classifier, finally determining the medical entity relations, and outputting a structured result. The invention utilizes the natural language understanding technology based on deep learning, reads and understands medical texts through a machine, and automatically extracts a large number of professional medical entities and relations, thereby remarkably improving the efficiency and quality of medical clinical scientific research and having great significance for constructing a special hospital database.

Description

Medical text data processing method and device
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to a method and a device for processing medical texts by using an artificial intelligence technology.
Background
Artificial intelligence (ARTIFICIAL INTELLIGENCE, al) refers to the intelligence exhibited by machines made by humans. Artificial intelligence generally refers to intelligence implemented by a general computer. Artificial intelligence includes weak artificial intelligence and strong artificial intelligence. Weak artificial intelligence (also known as narrow artificial intelligence) is generally understood to mean artificial intelligence technology that is focused on solving a problem in a particular field, and may also be considered as a technical tool applied to that field.
Natural language processing technology is an important branch of narrow artificial intelligence, focuses on processing and application of natural language, and has been widely applied in man-machine interaction. The category of natural language processing comprises the fields of information retrieval, information extraction, machine translation, text reading, word segmentation, part-of-speech tagging, automatic abstracting and the like.
In the practical application in the field of big data of health care, the medical record described by a doctor by using natural language can be analyzed by using word segmentation and labeling in the natural language processing technology, and information such as symptoms, diagnosis and treatment information, events and the like of a patient can be extracted from the medical record. The acquisition and standardization of the information play an important role in the clinical scientific research of doctors, the construction of an artificial intelligence auxiliary diagnosis and treatment system and other applications.
The medical text data contains abundant medical information, the structuring of the medical text is to carry out structuring analysis on irregular medical texts represented by electronic medical records and test reports, and the machine is enabled to automatically extract key information wanted by a user from language texts by combining clinical medical entity concepts. The information is helpful to support application scenes such as clinical academic research, medical knowledge graph construction, clinical auxiliary decision making and the like. However, vast amounts of medical text are not understandable, computationally nor computable to the machine, and such data, due to its complexity and expertise, requires medical researchers to expend significant effort in extracting valid information from the text. In order to more efficiently use these data and accurately extract information from medical text, a technology for structuring medical text is urgently needed.
In the existing scheme, the entity and relation recognition of the medical text is mainly carried out by using an entity relation joint extraction model, generally, the entity recognition task and the entity relation extraction task are jointly modeled, and the entity triples with the relation are directly obtained by sharing parameters of the model through a shared encoder. The scheme generally adopts BiLSTM or Chinese pre-training BERT to encode the text, ignores the importance of using a pre-training model to make Domain Transfer (Domain Transfer) by using a medical text, and the language model obtained based on a large amount of medical corpus fine tuning contains abundant medical priori knowledge, and has better feature expression capability than the pre-training model obtained based on general corpus training. Secondly, such schemes often ignore medical entity nesting situations, for example, a "right lung occupation" represents a lesion type, a "right lung" in a "right lung occupation" represents a body part, and two different types of entities have nesting relations, so that the existing scheme fails under the condition of nested entities. As for medical relationship identification, the existing schemes have poor flexibility, and the relationship classifier cannot be quickly customized according to different relationship modes, which restricts the expansibility of the model.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a medical text structuring method and device, which are used for accurately extracting key information from medical texts to form structured data by combining a medical pre-training model and a distance perception-based relation classifier by utilizing a natural language understanding technology.
In order to achieve the above object, the present invention provides a medical text structuring method, comprising the steps of:
According to the acquired public medical information, extracting data to construct a training set, and finely adjusting a Chinese medical pre-training model MC-BERT to complete Domain Transfer (Domain Transfer) of parameters;
Dividing a clinical medical text into words based on the fine-tuned MC-BERT to obtain a word element set with the length of N, constructing a span matrix with the length of N, wherein N is a natural number, sending the divided medical text into the MC-BERT to obtain a coding vector, judging a text range corresponding to a medical entity by using the start-stop positions of the matrix, and extracting the medical entity;
and (3) carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.
And fusing the extracted medical entity and the medical entity relationship.
Preferably, the disclosed medical information extraction data set is a CHIP2020 Chinese medical text named entity identification, chinese medical entity relation extraction data set, a CCKS2020 medical named entity identification, a medical entity and attribute extraction data set.
Preferably, the method for fine tuning the Chinese medical pre-training model comprises the following steps: all the public medical information extraction data sets are subjected to sequence labeling based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. When other types of entity Type-b are nested in a certain Type of medical entity Type-a, two types of entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, and a new entity Type label Type-a|type-b is generated. And fine-tuning the MC-BERT by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so as to obtain a new language model after the field migration.
Preferably, preprocessing clinical medical text data, cleaning and cutting long texts; dividing words by adopting a dictionary file of a BERT model, obtaining a token set with the length of N, constructing a word element matrix span with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; and obtaining the entity type logic score of the text fragment corresponding to the span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.
Preferably, the noted effective entities make a determination of relationships between entities by the following formula:
Where M represents the total number of entity relationship categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character ° represents the vector concatenation operation.
Preferably, the context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X N. The method further comprises the steps of: by constructing positive and negative samples to guide a model to learn the implicit relation between the medical entity pairs, the model is ensured to only judge the entity pairs with the fact medical relation.
Preferably, the relative distance feature vector between the entity pairs is:
di=Linear(|si2-ei1|) (3)
in the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer.
Preferably, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing entity pairs with medical relation in { head entity-medical relation, tail entity } format, visualizing and storing independently existing medical entities in { entity type, entity value } format.
The invention also provides a medical text structuring device, which comprises:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
compared with the prior art, the invention has the advantages and positive effects that:
The invention provides a medical text structuring method, which focuses on the feature extraction capability of a pre-training language model on text, adopts a medical information extraction dataset to identify a named entity as an entry point for fine-tuning a Chinese medical pre-training model aiming at the structural task characteristics of the medical text, and realizes the field adaptation of the language model. After the fine-tuned pre-training model is obtained, the entity label is encoded based on a token span matrix mode, so that the identifiable nested entity is ensured; based on the entity relation classifier of distance perception, the context relation among the entities is learned, and only entity pairs with the fact medical relation can be judged by constructing a positive and negative sample assurance model; and the structured content is output through the result fusion of the two stages, so that the data utilization efficiency of clinical medical texts is improved.
Drawings
FIG. 1 is a flow chart of a method of structuring medical text in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a medical text structuring method device according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a BIOES coding mode according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a word element matrix entity tag according to an embodiment of the present invention;
Detailed Description
The various aspects of the invention are described in detail below with reference to the drawings and detailed description. It will be apparent that the described embodiments are some, but not all, examples of embodiments of the invention. Elements, structures, and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.
The medical text structuring method of the embodiment of the invention, as shown in fig. 1, comprises the following steps:
step S1, extracting a data set from the collected public medical information to fine tune a Chinese medical pre-training model mcBERT by using a named entity recognition task to obtain a domain-adaptive pre-training language model; specifically, before "fine tuning chinese medical pre-training model mcBERT", it includes:
The disclosed medical information extraction data set is a Chinese medical text naming entity identification and Chinese medical entity relation extraction data set in CHIP2020, a CCKS2020 medical naming entity identification and medical entity and attribute extraction data set.
And (3) sequence labeling is carried out on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the start of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type. Labeling entity type tags mainly comprises: specific parts (Body part) of affected parts, obvious patient indexes (Symptom) or not, growth and development indexes (BMI), specific positions (direction) of the affected parts, disease names (Disease), whether sampling data (Sample) exist or not, disease progress conditions (Change), attribute characteristics (Feature), stimulus elements (INCENTIVE), time (Time) and Disease stage (device), wherein the type of a symptom marked entity can be added with a-number in front to indicate that the patient does not have the symptom or the sign, and the relation among the entities is expressed in an ordered pair mode. The method steps for obtaining symptoms and attributes using BIOES are as follows:
Extracting the entity of the medical information by adopting a collected named entity identification and relation extraction technology of the public medical information, and marking negative symptoms;
Determining the attribute corresponding to the entity by taking the specific part of the affected part, obvious patient indexes, growth and development indexes and sampling data as the entity;
extracting specific position and attribute characteristics of an affected part based on the presence or absence of obvious affected part indexes;
Based on the presence or absence of obvious patient indexes, extracting time, sampling data, the stage of the disease, the progress condition of the disease and stimulation elements;
Extracting the progress of the disease and the stimulation factors based on the presence or absence of obvious disease indexes;
extracting attribute features and stimulus elements based on whether sampling data exists or not;
and merging and de-duplication processing is carried out on the extracted entities and attributes.
Specifically, in the actual labeling process, when other types of entity Type-b are nested in a certain Type of medical entity Type-a, two entity types with nesting relationship are combined in pairs by adopting a mode of combining label layers, so that a new entity Type label Type-a|type-b is generated. For example, as shown in fig. 3, the text "two lung nodules" in patient two lung nodules "represents the lesion entity type," two lung "represents the site entity type, so when labeled" two lung "it is combined with labels" B-site |b-lesion, E-site|i-lesion ".
MC-BERT is a natural language understanding model BERT trained on large-scale Chinese medical corpora such as Chinese medical questions and answers, chinese medical encyclopedia, chinese electronic medical records, and the like, and a great deal of medical knowledge has been explicitly injected into the model. And then, the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through unified sequence marked data, so that a new language model after the field migration can be obtained, and the model is more suitable for an information extraction task.
S2, preprocessing clinical medical text data, cleaning and cutting long texts; performing word segmentation by adopting a self-contained vocabolar dictionary of the BERT model to obtain a word element set with the length of N and constructing a span matrix with the length of N for encoding the entity tag; and obtaining the entity type logic score of the text fragment corresponding to the span matrix by using the fine-tuned MC-Bert as embedding, wherein the score is larger than the threshold alpha and is regarded as an effective entity.
Specifically, preprocessing clinical medical text data, removing illegal messy code characters, and if the text length is greater than the upper limit 512 supported by BERT, cutting a long text by taking 512 as the length to obtain a plurality of data paragraphs; based on the BERT self-contained file named vocab.txt, character granularity is adopted to segment Chinese characters appearing in medical texts, the medical English characters and the numerals are segmented in a sub-word mode, a word element set with the length of N obtained after segmentation is used for constructing a span matrix with the length of N, the span matrix covers segment arrangement of all cases of input texts, and the condition that entities are nested is guaranteed not to appear any more. For example, the text "right lung occupation" shown in fig. 4 is segmented to construct a token span matrix of 4*4, and [0] [1] in span [0] [1] = bod represents the start-stop range of the text corresponding to the matrix, namely "right lung", and the actual type is "body"; in span [0] [3] =dis, [0] [3] represents the start-stop range of the text corresponding to the matrix, namely "right lung occupation", the actual type is "dis", and other non-solid parts are set to 0. Obtaining a word element set coding vector X N by using the fine-tuned MC-Bert as embedding mode and obtaining the word element set coding vector through nonlinear transformationAnd/>The inner product of the two is taken as logits value of the span matrix to evaluate the entity type score of the text segment corresponding to span [ start ] [ end ], and the score is larger than a threshold alpha which is set to 0.5 based on experience as a valid entity.
And S3, carrying out relationship discrimination on entity pairs with medical relationships based on the multi-classifier of the full-connection layer, and extracting the medical entity relationships.
Specifically, the marked medical entities are constructed into a training set in a pair mode, the entity pairs with the fact medical relation are defined as positive samples, the entity pairs without the medical relation are defined as negative samples after being randomly sampled, and the model is ensured to only judge the entity pairs with the fact medical relation. The entity pair performs relationship determination between entities through the following formula:
Where M represents the total number of entity relationship categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character ° represents the vector concatenation operation.
The context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And the head and tail feature vectors representing the ith entity and the middle and tail entities are obtained from the token set coding vector X N.
The relative distance feature vector between the entity pairs is:
di=Linear(|si2-ei1|) (3)
in the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding (position embedding) on the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after absolute values are subtracted, and a Linear (·) function represents further nonlinear mapping on the position vectors of the entity pair through a full connection layer. The mapped position vector and entity vector keep dimension consistent, and feature fusion is completed in a cascading mode.
And S4, traversing the extracted medical entity and medical entity relation, removing the medical entity with overlong text, visualizing and storing the entity pair with the medical relation in a { head entity-medical relation and tail entity } format, and visualizing and storing the independently existing medical entity in a { entity type and entity value } format. For example, the text "patient shows double lung nodule in 1 month CT examination in 2020" will be extracted after steps S2 and S3 (date, 1 month in 2020), (examination means, CT), (lesion, double lung nodule), wherein the relationship between "date" and "examination means" is "examination date" and is formatted as: { CT-date of examination, month 1 in 2020 }; wherein the entity "lesion" exists independently and is not in medical relation to other entities, formatted as: { lesions, double lung nodules }.
In summary, the invention provides a medical text structuring method, which can automatically perform structuring extraction on an input medical text to obtain a large number of professional medical entities and relations, and remarkably improve the efficiency and quality of medical clinical scientific research.
Example 2: referring to fig. 2, the present embodiment provides a medical text structuring apparatus. The functional models are described in detail as follows:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
Specifically, the medical entity extraction module uses a medical pre-training model MC-BERT after domain migration as embedding to judge whether the text range corresponding to the token span matrix index is a predefined medical entity or not;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
Specifically, the medical entity relation extraction module is used for training the model by constructed positive and negative sample pairs, integrating entity position feature vectors in the learning process, and carrying out relation identification among entities by using a multi-classifier.
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
further, the medical text structuring apparatus further comprises: and the labeling module is used for labeling entities and relations of the clinical medical text data.
The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made to the present invention within the spirit of the present invention and the scope of the appended claims should be construed as falling within the scope of the present invention.

Claims (6)

1. A medical text data processing method, the method comprising:
extracting a data set according to the acquired public medical information to construct a training set, and fine-tuning a Chinese medical pre-training model MC-BERT to finish domain migration of parameters;
Dividing a clinical medical text based on the fine-tuned MC-BERT to obtain a word element token set with the length of N, constructing a N-N matrix, wherein N is a natural number, then sending the divided medical text into the MC-BERT to obtain a coding vector, reversely pushing out a text range corresponding to a medical entity by using the position coordinates of the matrix, and extracting the medical entity;
Based on the multi-classifier of the full-connection layer, carrying out relationship discrimination on entity pairs with medical relationships, and extracting medical entity relationships;
performing result fusion on the extracted medical entity and the medical entity relationship;
The method for fine tuning the Chinese medical pre-training model comprises the following steps: performing sequence labeling on all collected public medical information extraction data sets based on BIOES coding modes, wherein B-Type represents the beginning of an entity, I-Type represents the middle of the entity, O represents a non-entity part, E-Type represents the tail of the entity, S-Type represents a single-word entity, and Type represents the corresponding medical entity Type; when nesting other types of entity types-b in a certain Type of medical entity Type-a, combining two types of entity types with nesting relationship in pairs by adopting a mode of combining label layers to generate a new entity Type label Type-a|type-b; the MC-BERT is finely tuned by taking a named entity recognition task as a learning target through the data marked by the unified sequence, so that a new language model after the field migration is obtained;
The extraction medical entity comprises the following specific steps: preprocessing clinical medical text data, cleaning and cutting long text; the method comprises the steps of performing word segmentation by adopting a dictionary file of a BERT model, obtaining a word element set with the length of N, constructing a span matrix with the length of N for encoding an entity tag, wherein a subscript value span [ start ] [ end ] =C of the matrix, wherein [ start ] [ end ] represents a start-stop range of a text corresponding to a medical entity, C represents an entity category, and C=0 represents a non-entity text; obtaining an entity type logic score of a text fragment corresponding to span [ start ] [ end ] by taking the fine-tuned MC-Bert as embedding, wherein the score is larger than a threshold alpha and is regarded as an effective entity;
and determining the relationship between the marked effective entities through the following formula:
Wherein M represents the total number of entity relation categories, p i represents the context vector represented by the ith entity pair, d i represents the relative distance feature vector between the ith entity pair, and the character degrees represent the vector cascading operation;
The context vector represented by the entity pair is:
In the method, in the process of the invention, And/>Representing the head-to-tail eigenvector of the head entity in the ith entity,/>And/>And representing the head and tail feature vectors of the tail entity in the ith entity pair, wherein the feature vectors are obtained from token set coding vectors X N, and the model is guided to learn the implicit relation between the medical entity pairs by constructing positive and negative samples, so that the model can only judge the entity pairs with the fact medical relation.
2. The method of claim 1, wherein the public medical information extraction dataset is a CHIP2020 chinese medical text named entity identification, chinese medical entity relationship extraction dataset, a CCKS2020 medical named entity identification, a medical entity and attribute extraction dataset.
3. A medical text data processing method according to claim 1, wherein labeling entity type tags mainly comprises: the specific part Body part of the affected part, obvious patient indexes Symptom, growth and development indexes BMI, specific position direction of the affected part, disease name Disease, sampling data Sample, disease progress Change, attribute Feature, stimulus element INCENTIVE, time and Disease stage Degree, wherein the type of a symptom marking entity is added with a-number in front to indicate that the patient does not have the symptom or the sign, the relation among the entities is expressed in an orderly pair mode, and the specific marking method comprises the following steps:
Extracting the entity of the medical information by adopting a collected named entity identification and relation extraction technology of the public medical information, and marking negative symptoms;
Determining the attribute corresponding to the entity by taking the specific part of the affected part, obvious patient indexes, growth and development indexes and sampling data as the entity;
extracting specific position and attribute characteristics of an affected part based on the presence or absence of obvious affected part indexes;
Based on the presence or absence of obvious patient indexes, extracting time, sampling data, the stage of the disease, the progress condition of the disease and stimulation elements;
Extracting the progress of the disease and the stimulation factors based on the presence or absence of obvious disease indexes;
extracting attribute features and stimulus elements based on whether sampling data exists or not;
and merging and de-duplication processing is carried out on the extracted entities and attributes.
4. The method of claim 1, wherein the relative distance feature vector between the pair of entities is:
di=Linear(|si2―ei1|)
In the formula, s i2、ei1 represents feature vectors of the ith entity in BERT position coding of the tail entity and the head entity respectively, the two vectors represent relative position relations of two medical entities in the entity pair after subtracting absolute values, and a Linear (·) function represents further nonlinear mapping of the position vectors of the entity pair through a full connection layer.
5. The medical text data processing method according to claim 1, wherein the extracted medical entity and medical entity relationship are traversed, the medical entity with overlong text is removed, the entity pair with the medical relationship is visualized and stored in { head entity-medical relationship, tail entity } format, and the independently existing medical entity is visualized and stored in { entity type, entity value } format.
6. A medical text data processing apparatus, comprising:
The data preprocessing module is used for cleaning and processing the input medical text;
the medical entity extraction module is used for inputting the medical text after the cleaning treatment into the natural language recognition model after the fine adjustment and extracting text fragments corresponding to the medical entity;
the medical entity relation extracting module is used for extracting the fact relation between the medical entity pairs by using a relation classifier of distance perception;
The double-stage result fusion module is used for fusing results of the medical entity and the medical entity relationship and displaying the results;
the apparatus performs and implements the medical text data processing method according to any one of claims 1 to 5.
CN202310478699.1A 2023-04-27 2023-04-27 Medical text data processing method and device Active CN116737924B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310478699.1A CN116737924B (en) 2023-04-27 2023-04-27 Medical text data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310478699.1A CN116737924B (en) 2023-04-27 2023-04-27 Medical text data processing method and device

Publications (2)

Publication Number Publication Date
CN116737924A CN116737924A (en) 2023-09-12
CN116737924B true CN116737924B (en) 2024-06-25

Family

ID=87912216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310478699.1A Active CN116737924B (en) 2023-04-27 2023-04-27 Medical text data processing method and device

Country Status (1)

Country Link
CN (1) CN116737924B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240916B (en) * 2023-11-14 2024-02-13 阿里健康科技(中国)有限公司 Method for transmitting and storing structured medical data and related device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818676A (en) * 2021-02-02 2021-05-18 东北大学 Medical entity relationship joint extraction method

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4129048B2 (en) * 2005-06-15 2008-07-30 松下電器産業株式会社 Named entity extraction apparatus, method, and program
US20090249182A1 (en) * 2008-03-31 2009-10-01 Iti Scotland Limited Named entity recognition methods and apparatus
US10133728B2 (en) * 2015-03-20 2018-11-20 Microsoft Technology Licensing, Llc Semantic parsing for complex knowledge extraction
RU2619193C1 (en) * 2016-06-17 2017-05-12 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Multi stage recognition of the represent essentials in texts on the natural language on the basis of morphological and semantic signs
US20180025121A1 (en) * 2016-07-20 2018-01-25 Baidu Usa Llc Systems and methods for finer-grained medical entity extraction
CN107977368B (en) * 2016-10-21 2021-12-10 京东方科技集团股份有限公司 Information extraction method and system
US20190006027A1 (en) * 2017-06-30 2019-01-03 Accenture Global Solutions Limited Automatic identification and extraction of medical conditions and evidences from electronic health records
US20210375488A1 (en) * 2020-05-29 2021-12-02 Medius Health System and methods for automatic medical knowledge curation
CN112989835B (en) * 2021-04-21 2021-10-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
AU2021106425A4 (en) * 2021-08-22 2021-11-04 Honghai Feng Method, system and apparatus for extracting entity words of diseases and their corresponding laboratory indicators from Chinese medical texts
CN114036934A (en) * 2021-10-15 2022-02-11 浙江工业大学 Chinese medical entity relation joint extraction method and system
CN114692636B (en) * 2022-03-09 2023-11-03 南京海泰医疗信息系统有限公司 Nested named entity identification method based on relationship classification and sequence labeling
CN114637852B (en) * 2022-04-24 2023-12-08 四川医枢科技有限责任公司 Entity relation extraction method, device, equipment and storage medium of medical text
CN115510242A (en) * 2022-10-04 2022-12-23 河南科技大学 Chinese medicine text entity relation combined extraction method
CN115879473B (en) * 2022-12-26 2023-12-01 淮阴工学院 Chinese medical named entity recognition method based on improved graph attention network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818676A (en) * 2021-02-02 2021-05-18 东北大学 Medical entity relationship joint extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于BERT的中文电子病历命名实体识别;李灵芳;杨佳琦;李宝山;杜永兴;胡伟健;;内蒙古科技大学学报;20200315(第01期);全文 *

Also Published As

Publication number Publication date
CN116737924A (en) 2023-09-12

Similar Documents

Publication Publication Date Title
CN108831559B (en) Chinese electronic medical record text analysis method and system
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN109508459B (en) Method for extracting theme and key information from news
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN113486667B (en) Medical entity relationship joint extraction method based on entity type information
CN111538845A (en) Method, model and system for constructing kidney disease specialized medical knowledge map
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN112800766A (en) Chinese medical entity identification and labeling method and system based on active learning
CN112241457A (en) Event detection method for event of affair knowledge graph fused with extension features
CN111061882A (en) Knowledge graph construction method
CN116737924B (en) Medical text data processing method and device
CN113903422A (en) Medical image diagnosis report entity extraction method, device and equipment
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN112732863B (en) Standardized segmentation method for electronic medical records
CN113111660A (en) Data processing method, device, equipment and storage medium
CN112749277A (en) Medical data processing method and device and storage medium
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN113130025A (en) Entity relationship extraction method, terminal equipment and computer readable storage medium
Yao et al. A unified approach to researcher profiling
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
CN114780738A (en) Medical image examination project name standardization method and system based on different application scenes
CN113297851A (en) Recognition method for confusable sports injury entity words
Bettouche et al. Mapping researcher activity based on publication data by means of transformers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant