CN117993391B - Medical named entity recognition and clinical term standardization method and device - Google Patents

Medical named entity recognition and clinical term standardization method and device Download PDF

Info

Publication number
CN117993391B
CN117993391B CN202410406655.2A CN202410406655A CN117993391B CN 117993391 B CN117993391 B CN 117993391B CN 202410406655 A CN202410406655 A CN 202410406655A CN 117993391 B CN117993391 B CN 117993391B
Authority
CN
China
Prior art keywords
data
language model
named entity
training
entity identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410406655.2A
Other languages
Chinese (zh)
Other versions
CN117993391A (en
Inventor
葛承泽
张奇
王实
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huimeiyun Technology Co ltd
Original Assignee
Beijing Huimeiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huimeiyun Technology Co ltd filed Critical Beijing Huimeiyun Technology Co ltd
Priority to CN202410406655.2A priority Critical patent/CN117993391B/en
Publication of CN117993391A publication Critical patent/CN117993391A/en
Application granted granted Critical
Publication of CN117993391B publication Critical patent/CN117993391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a medical named entity recognition and clinical term standardization method and device, wherein the method comprises the following steps: and acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data. And carrying out parameter fine adjustment on the large language model through training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches the set expectation. And acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity. According to the method, medical named entity recognition and clinical term standardization tasks are combined into a Natural Language Generation (NLG) task, so that a model prediction effect is good, and the complexity of model training, deployment and maintenance is reduced.

Description

Medical named entity recognition and clinical term standardization method and device
Technical Field
The invention relates to the technical field of medical language processing, in particular to a method and a device for identifying medical named entities and standardizing clinical terms.
Background
Medical named entity Recognition (MEDICAL NAMED ENTITY Recognition, medNER for short) is a task in the field of Natural Language Processing (NLP) aimed at recognizing and extracting named entities with specific semantics from medical text. These named entities include various medical terms such as diseases, symptoms, medications, methods of treatment, and medical devices, among others. Medical texts typically contain a large number of terms and terms, so automated named entity recognition of these medical texts is of great importance for medical information extraction, clinical research and medical knowledge management.
Clinical term normalization techniques refer to normalizing various terms, expressions, and medical data in the medical and clinical fields to achieve consistency and interoperability among different medical systems, institutions, and platforms. This technique aims to solve the problem of confusion, misunderstanding or inconsistency that may occur in the communication of medical information, to improve the quality, manageability and usability of medical data. However, with the rapid development of hospital informatization, many systems rely on term standardization technologies, such as CDSS systems, DRG medical insurance management systems, single illness report systems, etc., and are faced with numerous and non-standardized medical entities, which are difficult to apply without term standardization.
Currently, in the prior art, at least two steps are required to identify standardized terms from electronic medical records, namely, identifying named entities first and then performing term standardization tasks. The main method for identifying the medical named entity in the prior art is to encode a text by using a deep neural network and then decode the text by using a CRF (conditional random field), wherein a commonly used deep neural network model is a BERT model. Whereas in the prior art, taking surgery as an example, the term standardization task is to find a standard word a' corresponding to the original surgical word a from all ICD standard words, but in the prior art, the main method of clinical term standardization is to use a strategy of recall and fine-pitch two phases. For each original word, a specified number k of candidate sets, k may be 5, 10, 20, etc., are recalled from all ICD candidate words using either the BM25 algorithm or the Jaccard algorithm. After recall, refined ranking is performed, in the prior art, mainly based on semantic matching of text embedding, specifically, similarity is calculated for embedding of the original word and each embedding of the candidate set of recall, and the ICD standard word with the highest similarity is used as a final output result.
However, in the existing entity identification and term standardization processing method, to identify standard terms from electronic medical records, multiple models need to be connected in series, which results in high complexity of model training, deployment and maintenance. In addition, the one-to-many problem of term normalization is often not well predicted, and the normalization process is poor for the long tail problem of entity recognition and term normalization.
Disclosure of Invention
Based on the above, it is necessary to provide a method and a device for identifying medical named entity and standardizing clinical terms, which have simple model training, deployment and maintenance and good prediction effect, aiming at the technical problems.
The invention provides a medical named entity identification and clinical term standardization method, which comprises the following steps:
Acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
Performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation;
And acquiring a current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity.
In one embodiment, the acquiring the named entity identification data with labels and the clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a form of question-answer pairs, to obtain training data, includes:
acquiring the named entity identification data and clinical term standardized data, and respectively marking the named entity identification data and the clinical term standardized data to obtain marked named entity identification data and marked clinical term standardized data;
And identifying symptom entities in the named entity identification data and the clinical term standardization data based on the named entity identification data with labels and the clinical term standardization data with labels.
In one embodiment, the obtaining named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a form of question-answer pairs, to obtain training data includes:
Converting the named entity identification data and the clinical term standardization data into a generated data format in a question-answer pair form based on symptom entities in the named entity identification data and the clinical term standardization data;
and converting the named entity identification task and the clinical term standardization task to be processed into a text generation task according to a generated data format between the named entity identification data and the clinical term standardization data.
In one embodiment, the acquiring the named entity identification data with labels and the clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a form of question-answer pairs, to obtain training data, further includes:
acquiring a corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data through the text generation task;
Acquiring the training data according to the corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data;
The training data comprises symptom entities in named entity identification data, the symptom entities in clinical term standardization data and the corresponding relations.
In one embodiment, the performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation, includes:
Adding a bypass on one side of a pre-training language model, and fixing model parameters of the pre-training language model during model training to train a dimension reduction matrix and a dimension increase matrix in the bypass, wherein the input and output dimensions of the pre-training language model are unchanged;
and calling the pre-training language model to superimpose the dimension reduction matrix and the dimension increase matrix with the model parameters of the pre-training language model so as to obtain superimposed model parameters.
In one embodiment, the performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation, further includes:
Initializing the dimension reduction matrix through random Gaussian distribution, and calling a 0 matrix to initialize the dimension increase matrix, so that the pre-training language model trains the initial bypass matrix to be the 0 matrix;
And taking the ChatGLM-6B open source model as a model base, calling LoRA high-efficiency parameters to finely tune the training data, and combining the superposition model parameters to train the large language model to obtain the finely tuned large language model.
In one embodiment, the obtaining the current medical record text and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity includes:
calling the trimmed large language model to process the current medical record text so as to convert the current medical record text into a question form in the generated data format of the question-answer pair form;
and acquiring standardized terms corresponding to the entities in the current medical record text, which are output by the trimmed large language model, based on the problem form of the current medical record text.
The invention also provides a medical named entity recognition and clinical term standardization device, which comprises:
the data processing module is used for acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
The model fine-tuning module is used for carrying out parameter fine-tuning on the large language model through the training data, and deploying the fine-tuned large language model when the output data of the large language model reaches a set expectation;
The standardized processing module is used for acquiring the current medical record text and taking the current medical record text as the input of the trimmed large language model so as to output the entity in the current medical record text and the standardized term corresponding to the entity.
The invention also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the medical named entity identification and clinical term normalization method as described in any one of the above when executing the computer program.
The invention also provides a computer storage medium storing a computer program which when executed by a processor implements a medical named entity identification and clinical term normalization method as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a medical named entity identification and clinical term normalization method as described in any one of the above.
The medical named entity identification and clinical term standardization method and device obtain the training data by acquiring the named entity identification data with labels and the clinical term standardization data with labels, and converting the named entity identification data and the clinical term standardization data into a generated data format in a question-answer pair form. And then, carrying out parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches the set expectation. And finally, acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity. According to the method, medical named entity recognition and clinical term standardization tasks are combined into one task, the two Natural Language Understanding (NLU) tasks are converted into one Natural Language Generation (NLG) task, and for one-to-many problems of the clinical term standardization tasks, a recall and fine-ranking model is changed into a generation task, so that the problem of predicting the number of standard terms is better solved, the model prediction effect is better, and the complexity of model training, deployment and maintenance is reduced.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a medical named entity identification and clinical term standardization method provided by the invention;
FIG. 2 is a schematic diagram of a model deployment flow of a method for identifying a medical named entity and standardizing clinical terms in a specific embodiment provided by the invention;
FIG. 3 is a schematic diagram of a model efficient parameter tuning process of a method for identifying a medical named entity and standardizing clinical terms in a specific embodiment provided by the invention;
FIG. 4 is a second flow chart of a method for identifying a medical named entity and standardizing clinical terms according to the present invention;
FIG. 5 is a third flow chart of a method for identifying medical named entities and standardizing clinical terms according to the present invention;
FIG. 6 is a flow chart of a method for identifying medical named entities and standardizing clinical terms according to the present invention;
FIG. 7 is a fifth flow chart of a method for identifying medical named entities and standardizing clinical terms according to the present invention;
FIG. 8 is a flowchart illustrating a method for identifying a medical named entity and standardizing clinical terms according to the present invention;
FIG. 9 is a flow chart of a method for identifying medical named entities and standardizing clinical terms according to the present invention;
FIG. 10 is a schematic diagram of a medical named entity recognition and clinical term normalization apparatus according to the present invention;
fig. 11 is an internal structural diagram of a computer device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The medical named entity recognition and clinical term normalization method and apparatus of the present invention are described below in connection with fig. 1-11.
As shown in fig. 1, in one embodiment, a medical named entity identification and clinical term normalization method includes the steps of:
step S110, named entity identification data with labels and clinical term standardized data with labels are obtained, and the named entity identification data and the clinical term standardized data are converted into a generated data format in a question-answer pair form, so that training data are obtained.
Specifically, the server acquires named entity identification data with labels and clinical term standardized data with labels, and converts the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data.
In a specific embodiment, as shown in fig. 2, the medical named entity recognition and clinical term standardization method provided by the invention constructs a generated data format for the labeled named entity recognition data and labeled clinical term standardization data in the process of constructing training data in a question-answer pair form. Training data construction is a key point of integrating a named entity recognition task and a clinical term standardization task in the method. For example, the original named entity annotation data and clinical term standardized annotation data are "the most significant problem for the original text is that there is mucus in the great bowels, and routine examination of the stool is required", where there are symptomatic entities "there is mucus in the stool" and medical examination entities "routine stool". There is a standard term "stool mucus" for symptomatic entities.
In this embodiment, in constructing the training data in the form of "question-answer" pairs, for example, input: "ask: medical named entity recognition is carried out on the following sentences, and normalized labels of symptom entities are given:
The most important problem is that the stool has mucus and needs to be checked for routine stool
Entity options: medicine, drug classification, medical examination, treatment, symptoms).
And (3) outputting: "answer:
Medicine: without any means for
Drug classification: without any means for
Medical examination: conventional stool
Treatment: without any means for
Symptoms: the great food contains mucus
Normalization label of symptoms: stool mucus).
Through the above construction of training data, the named entity recognition task and the term normalization task can be converted into a text generation task.
And step S120, performing parameter fine adjustment on the large language model through training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches the set expectation.
Specifically, the server performs parameter fine adjustment on the large language model through training data, and deploys the fine-adjusted large language model when the output data of the large language model reaches a set expectation.
Referring to fig. 2 and 3, in a specific embodiment, the medical named entity recognition and clinical term standardization method provided by the invention adopts ChatGLM-6B open source model as model base, and uses LoRA to fine tune training data in a manner of fine tuning smiling parameters. A bypass is added beside the original PLM (Pre-trained Language Model ) to perform a dimension reduction and dimension increase operation to simulate the so-called INTRINSIC RANK (eigenrank). And when the model is trained, parameters of PLM are fixed, and only the dimension-reducing matrix A and the dimension-increasing matrix B are trained. And the input and output dimensions of the model are unchanged, and parameters of BA and PLM are overlapped during output. Initializing A with random Gaussian distribution, initializing B with 0 matrix, and ensuring that the bypass matrix is still 0 matrix at the beginning of training.
Step S130, obtaining the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity and the standardized term corresponding to the entity in the current medical record text.
Specifically, the server acquires a current medical record text to be standardized, namely the current medical record text, and takes the current medical record text as input data of the trimmed large language model to output entities in the current medical record text and standardized terms corresponding to the entities.
In combination with fig. 2, in a specific embodiment, after obtaining a trimmed large language model, the medical named entity recognition and clinical term standardization method provided by the invention only needs to construct a "question" form for a new medical record text according to the same construction mode as training data, and inputs the new medical record text into a trimmed ChatGLM-6B model to obtain an output of the model, and obtains standardized terms corresponding to various entities predicted in the output and symptom entities.
In this embodiment, compared with the existing two-stage method in which clinical terms can be obtained, the method discloses a method for simultaneously completing the medical named entity recognition and clinical term standardization tasks by using a large language model, and the named entity and the standard term can be obtained simultaneously by performing the text generation task only once. And the experimental results show that compared with the methods of BERT+CRF (named entity recognition) and BM25+BERT (term standardization), the F1 score is taken as an evaluation index, the effect of the method is not reduced, and the named entity recognition result and the term standardization result are improved by about 1.2% and 0.9% respectively compared with the traditional model.
According to the medical named entity identification and clinical term standardization method, the named entity identification data with labels and the clinical term standardization data with labels are obtained, and the named entity identification data and the clinical term standardization data are converted into a generated data format in a question-answer pair form, so that training data are obtained. And then, carrying out parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches the set expectation. And finally, acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity. According to the method, medical named entity recognition and clinical term standardization tasks are combined into one task, the two Natural Language Understanding (NLU) tasks are converted into one Natural Language Generation (NLG) task, and for one-to-many problems of the clinical term standardization tasks, a recall and fine-ranking model is changed into a generation task, so that the problem of predicting the number of standard terms is better solved, the model prediction effect is better, and the complexity of model training, deployment and maintenance is reduced.
As shown in fig. 4, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention obtains named entity identification data with labels and clinical term standardized data with labels, and converts the named entity identification data and the clinical term standardized data into a generated data format in the form of question-answer pairs, so as to obtain training data, and before the method includes the following steps:
Step S410, named entity identification data and clinical term standardized data are obtained, and the named entity identification data and the clinical term standardized data are respectively marked, so that named entity identification data with marks and clinical term standardized data with marks are obtained.
Specifically, the server acquires named entity identification data and clinical term standardized data, marks the named entity identification data and the clinical term standardized data respectively, and obtains marked named entity identification data and marked clinical term standardized data.
Step S420, identifying symptom entities in the named entity identification data and the clinical term standardization data based on the named entity identification data with labels and the clinical term standardization data with labels.
Specifically, the server identifies the symptom entity in the named entity identification data and the clinical term standardization data based on the named entity identification data with labels and the clinical term standardization data with labels obtained in step S410.
As shown in fig. 5, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention obtains named entity identification data with labels and clinical term standardized data with labels, and converts the named entity identification data and the clinical term standardized data into a generated data format in the form of question-answer pairs, so as to obtain training data, and specifically includes the following steps:
Step S112, based on the named entity identification data and the symptom entity in the clinical term standardized data, the named entity identification data and the clinical term standardized data are converted into a generated data format in a question-answer pair form.
Specifically, in the process of acquiring training data, the server firstly converts the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form based on symptom entities in the named entity identification data and the clinical term standardized data.
Step S114, converting the named entity recognition task and the clinical term standardization task to be processed into text generation tasks according to the generated data format between the named entity recognition data and the clinical term standardization data.
Specifically, the server converts the named entity recognition task and the clinical term standardization task to be processed into a text generation task according to the generated data format between the named entity recognition data and the clinical term standardization data obtained by conversion in step S112.
As shown in fig. 6, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention obtains named entity identification data with labels and clinical term standardized data with labels, and converts the named entity identification data and the clinical term standardized data into a generated data format in the form of question-answer pairs, so as to obtain training data, and specifically further includes the following steps:
In step S116, the correspondence between the symptom entity in the named entity recognition data and the symptom entity in the clinical term standardization data is obtained through the text generation task.
Specifically, the server acquires the correspondence between the symptom entity in the named entity recognition data and the symptom entity in the clinical term standardization data through the text generation task obtained in step S114.
Step S118, obtaining training data according to the correspondence between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data.
Specifically, the server obtains final training data according to the corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data, wherein the training data at least comprises the corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data and the corresponding relation between the two entities.
As shown in fig. 7, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention performs parameter fine adjustment on a large language model through training data, and deploys the fine-adjusted large language model when output data of the large language model reaches a set expectation, and specifically includes the following steps:
Step S122, adding a bypass on one side of the pre-training language model, and fixing model parameters of the pre-training language model during model training to train a dimension reduction matrix and a dimension increase matrix in the bypass, wherein the input and output dimensions of the pre-training language model are unchanged.
Specifically, the server adds a bypass on one side of the pre-training language model (PLM), and fixes model parameters of the pre-training language model during training of the large language model so as to train a dimension reduction matrix (A) and a dimension increase matrix (B) in the bypass, and the input and output dimensions of the pre-training language model are kept unchanged.
Step S124, calling the pre-training language model to superimpose the dimension reduction matrix and the dimension increase matrix with the model parameters of the pre-training language model to obtain superimposed model parameters.
Specifically, the server calls the pre-training language model to superimpose the dimension reduction matrix, the dimension increase matrix and model parameters of the pre-training language model to obtain superimposed model parameters.
As shown in fig. 8, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention performs parameter fine adjustment on a large language model through training data, and deploys the fine-adjusted large language model when output data of the large language model reaches a set expectation, and specifically further includes the following steps:
And step S126, initializing a dimension reduction matrix through random Gaussian distribution, and calling a 0 matrix to initialize a dimension increase matrix, so that the initial bypass matrix of the training of the pre-training language model is 0 matrix.
Specifically, the server performs initialization processing on the dimension reduction matrix through random Gaussian distribution, and calls a 0 matrix to perform initialization processing on the dimension increase matrix, so that the matrix of the initial bypass of training of the pre-training language model is the 0 matrix.
And S128, taking the ChatGLM-6B open source model as a model base, calling LoRA high-efficiency parameters to finely tune training data, and combining superposition model parameters to train the large language model to obtain the finely tuned large language model.
Specifically, the server takes hatGLM-6B open source model as model base, invokes LoRA high-efficiency parameters to fine tune training data, combines superposition model parameters to train the large language model, and obtains the final fine-tuned large language model.
As shown in fig. 9, in one embodiment, the method for identifying a medical named entity and standardizing clinical terms provided by the present invention obtains a current medical record text, and uses the current medical record text as an input of a trimmed large language model to output the entity and the standardized term corresponding to the entity in the current medical record text, and specifically includes the following steps:
And step S132, calling the trimmed large language model to process the current medical record text so as to convert the current medical record text into a question form in a generated data format in a question-answer pair form.
Specifically, in the process of obtaining the standardized terms, the server firstly calls the trimmed large language model to process the current medical record text so as to convert the current medical record text into a question form in a data format for generating question-answer pairs.
Step S134, based on the problem form of the current medical record text, standardized terms corresponding to the entities in the current medical record text and output by the trimmed large language model are obtained.
Specifically, the server obtains the standardized terms corresponding to the entities in the current medical record text output by the trimmed large language model based on the question form of the current medical record text obtained in step S132.
The medical named entity recognition and clinical term standardization apparatus provided by the present invention will be described below, and the medical named entity recognition and clinical term standardization apparatus described below and the medical named entity recognition and clinical term standardization method described above may be referred to correspondingly.
As shown in FIG. 10, in one embodiment, a medical named entity recognition and clinical term normalization apparatus includes a data processing module 1010, a model fine tuning module 1020, and a normalization processing module 1030.
The data processing module 1010 is configured to obtain named entity identification data with labels and clinical term standardized data with labels, and convert the named entity identification data and the clinical term standardized data into a generated data format in the form of question-answer pairs, so as to obtain training data.
The model fine tuning module 1020 is configured to perform parameter fine tuning on the large language model through the training data, and deploy the fine-tuned large language model when the output data of the large language model reaches a set expectation.
The normalization processing module 1030 is configured to obtain a current medical record text, and take the current medical record text as an input of the trimmed large language model, so as to output an entity in the current medical record text and a standardized term corresponding to the entity.
In this embodiment, the medical named entity recognition and clinical term standardization apparatus provided by the present invention further includes a data labeling module, configured to:
and acquiring named entity identification data and clinical term standardized data, and respectively labeling the named entity identification data and the clinical term standardized data to obtain labeled named entity identification data and labeled clinical term standardized data.
Based on the named entity identification data with the labels and the clinical term normalization data with the labels, symptom entities in the named entity identification data and the clinical term normalization data are identified.
In this embodiment, the medical named entity recognition and clinical term standardization apparatus provided by the present invention, the data processing module is specifically configured to:
The named entity identification data and the clinical term standardization data are converted into a generated data format in the form of question-answer pairs based on the symptom entities in the named entity identification data and the clinical term standardization data.
And converting the named entity identification task and the clinical term standardization task to be processed into a text generation task according to a generated data format between the named entity identification data and the clinical term standardization data.
In this embodiment, the medical named entity recognition and clinical term standardization apparatus provided by the present invention, the data processing module is specifically further configured to:
And acquiring the correspondence between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data through a text generation task.
And acquiring training data according to the corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data.
The training data comprises symptom entities in the named entity identification data, the symptom entities in the clinical term standardization data and corresponding relations.
In this embodiment, the medical named entity recognition and clinical term standardization apparatus provided by the present invention, the model fine tuning module is specifically configured to:
And adding a bypass on one side of the pre-training language model, and fixing model parameters of the pre-training language model during model training to train a dimension reduction matrix and a dimension increase matrix in the bypass, wherein the input and output dimensions of the pre-training language model are unchanged.
And calling the pre-training language model to superimpose the dimension reduction matrix, the dimension increase matrix and the model parameters of the pre-training language model so as to obtain superimposed model parameters.
In this embodiment, the medical named entity recognition and clinical term standardization apparatus provided by the present invention, the model fine tuning module is specifically further configured to:
Initializing a dimension reduction matrix through random Gaussian distribution, and calling a0 matrix to initialize a dimension increase matrix, so that a matrix of an initial bypass of training of the pre-training language model is the 0 matrix.
And taking the ChatGLM-6B open source model as a model base, calling LoRA high-efficiency parameters to perform fine adjustment on training data, and combining superposition model parameters to train the large language model to obtain the fine-adjusted large language model.
In this embodiment, the medical named entity recognition and clinical term standardization device provided by the invention, the standardization processing module is specifically used for:
And calling the trimmed large language model to process the current medical record text so as to convert the current medical record text into a question form in a generated data format of a question-answer pair form.
Based on the problem form of the current medical record text, standardized terms corresponding to the entities in the current medical record text and output by the trimmed large language model are obtained.
Fig. 11 illustrates a physical structure diagram of an electronic device, which may be an intelligent terminal, and an internal structure diagram thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for medical named entity identification and clinical term normalization, the method comprising:
Acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
performing parameter fine adjustment on the large language model through training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation;
And acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity.
It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In another aspect, the present invention also provides a computer storage medium storing a computer program which when executed by a processor implements a method for identifying a medical named entity and normalizing clinical terms, the method comprising:
Acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
performing parameter fine adjustment on the large language model through training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation;
And acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity.
In yet another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of an electronic device reads the computer instructions from a computer readable storage medium, the processor executing the computer instructions to implement a medical named entity recognition and clinical term normalization method, the method comprising:
Acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
performing parameter fine adjustment on the large language model through training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation;
And acquiring the current medical record text, and taking the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory.
By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (8)

1. A method of medical named entity identification and clinical term normalization, the method comprising:
Acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
Performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation;
Acquiring a current medical record text, and taking the current medical record text as the input of the trimmed large language model to output entities in the current medical record text and standardized terms corresponding to the entities;
The step of performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation comprises the following steps:
Adding a bypass on one side of a pre-training language model, and fixing model parameters of the pre-training language model during model training to train a dimension reduction matrix and a dimension increase matrix in the bypass, wherein the input and output dimensions of the pre-training language model are unchanged;
Calling the pre-training language model to superimpose the dimension reduction matrix and the dimension increase matrix with model parameters of the pre-training language model so as to obtain superimposed model parameters;
The step of performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation, further comprises:
Initializing the dimension reduction matrix through random Gaussian distribution, and calling a 0 matrix to initialize the dimension increase matrix, so that the pre-training language model trains the initial bypass matrix to be the 0 matrix;
And taking the ChatGLM-6B open source model as a model base, calling LoRA high-efficiency parameters to finely tune the training data, and combining the superposition model parameters to train the large language model to obtain the finely tuned large language model.
2. The method for identifying and standardizing clinical terms of medical named entities according to claim 1, wherein the steps of obtaining named entity identification data with labels and standardized data with clinical terms with labels, and converting the named entity identification data and the standardized data with clinical terms into a generated data format in the form of question-answer pairs, and obtaining training data, include:
acquiring the named entity identification data and clinical term standardized data, and respectively marking the named entity identification data and the clinical term standardized data to obtain marked named entity identification data and marked clinical term standardized data;
And identifying symptom entities in the named entity identification data and the clinical term standardization data based on the named entity identification data with labels and the clinical term standardization data with labels.
3. The method for identifying and standardizing clinical terms for medical named entities according to claim 2, wherein said obtaining named entity identification data with labels and clinical term standardized data with labels and converting said named entity identification data and clinical term standardized data into a generated data format in the form of question-answer pairs, obtaining training data, comprises:
Converting the named entity identification data and the clinical term standardization data into a generated data format in a question-answer pair form based on symptom entities in the named entity identification data and the clinical term standardization data;
and converting the named entity identification task and the clinical term standardization task to be processed into a text generation task according to a generated data format between the named entity identification data and the clinical term standardization data.
4. The method for identifying and standardizing clinical terms for medical named entities according to claim 3, wherein said acquiring named entity identification data with labels and clinical term standardized data with labels and converting said named entity identification data and clinical term standardized data into a generated data format in the form of question-answer pairs, obtaining training data, further comprises:
acquiring a corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data through the text generation task;
Acquiring the training data according to the corresponding relation between the symptom entity in the named entity identification data and the symptom entity in the clinical term standardization data;
The training data comprises symptom entities in named entity identification data, the symptom entities in clinical term standardization data and the corresponding relations.
5. The method of claim 1 to 4, wherein the obtaining the current medical record text and using the current medical record text as the input of the trimmed large language model to output the entity in the current medical record text and the standardized term corresponding to the entity comprises:
calling the trimmed large language model to process the current medical record text so as to convert the current medical record text into a question form in the generated data format of the question-answer pair form;
and acquiring standardized terms corresponding to the entities in the current medical record text, which are output by the trimmed large language model, based on the problem form of the current medical record text.
6. A medical named entity recognition and clinical term normalization device, the device comprising:
the data processing module is used for acquiring named entity identification data with labels and clinical term standardized data with labels, and converting the named entity identification data and the clinical term standardized data into a generated data format in a question-answer pair form to obtain training data;
The model fine-tuning module is used for carrying out parameter fine-tuning on the large language model through the training data, and deploying the fine-tuned large language model when the output data of the large language model reaches a set expectation;
the standardized processing module is used for acquiring a current medical record text, and taking the current medical record text as the input of the trimmed large language model so as to output entities in the current medical record text and standardized terms corresponding to the entities;
The step of performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation comprises the following steps:
Adding a bypass on one side of a pre-training language model, and fixing model parameters of the pre-training language model during model training to train a dimension reduction matrix and a dimension increase matrix in the bypass, wherein the input and output dimensions of the pre-training language model are unchanged;
Calling the pre-training language model to superimpose the dimension reduction matrix and the dimension increase matrix with model parameters of the pre-training language model so as to obtain superimposed model parameters;
The step of performing parameter fine adjustment on the large language model through the training data, and deploying the fine-adjusted large language model when the output data of the large language model reaches a set expectation, further comprises:
Initializing the dimension reduction matrix through random Gaussian distribution, and calling a 0 matrix to initialize the dimension increase matrix, so that the pre-training language model trains the initial bypass matrix to be the 0 matrix;
And taking the ChatGLM-6B open source model as a model base, calling LoRA high-efficiency parameters to finely tune the training data, and combining the superposition model parameters to train the large language model to obtain the finely tuned large language model.
7. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
8. A computer storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5.
CN202410406655.2A 2024-04-07 2024-04-07 Medical named entity recognition and clinical term standardization method and device Active CN117993391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410406655.2A CN117993391B (en) 2024-04-07 2024-04-07 Medical named entity recognition and clinical term standardization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410406655.2A CN117993391B (en) 2024-04-07 2024-04-07 Medical named entity recognition and clinical term standardization method and device

Publications (2)

Publication Number Publication Date
CN117993391A CN117993391A (en) 2024-05-07
CN117993391B true CN117993391B (en) 2024-06-25

Family

ID=90901463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410406655.2A Active CN117993391B (en) 2024-04-07 2024-04-07 Medical named entity recognition and clinical term standardization method and device

Country Status (1)

Country Link
CN (1) CN117993391B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491499A (en) * 2019-07-10 2019-11-22 厦门大学 Clinical aid decision-making method and system towards mark electronic health record
CA3189988A1 (en) * 2020-08-19 2022-02-24 Carl BATE Augmented intelligence for next-best-action in patient care

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024128B2 (en) * 2004-09-07 2011-09-20 Gene Security Network, Inc. System and method for improving clinical decisions by aggregating, validating and analysing genetic and phenotypic data
EP3637435A1 (en) * 2018-10-12 2020-04-15 Fujitsu Limited Medical diagnostic aid and method
CN110838368B (en) * 2019-11-19 2022-11-15 广州西思数字科技有限公司 Active inquiry robot based on traditional Chinese medicine clinical knowledge map
TW202230163A (en) * 2020-10-14 2022-08-01 加拿大商Gbs全球生物製藥公司 Method and systems for phytomedicine analytics for research optimization at scale
CN113934824B (en) * 2021-12-15 2022-05-06 之江实验室 Similar medical record matching system and method based on multi-round intelligent question answering
CN116719913A (en) * 2023-04-27 2023-09-08 江苏师范大学 Medical question-answering system based on improved named entity recognition and construction method thereof
CN117422074A (en) * 2023-10-23 2024-01-19 苏州赛美科基因科技有限公司 Method, device, equipment and medium for standardizing clinical information text
CN117577254A (en) * 2023-11-17 2024-02-20 上海交通大学医学院附属瑞金医院 Method and system for constructing language model in medical field and structuring text of electronic medical record

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491499A (en) * 2019-07-10 2019-11-22 厦门大学 Clinical aid decision-making method and system towards mark electronic health record
CA3189988A1 (en) * 2020-08-19 2022-02-24 Carl BATE Augmented intelligence for next-best-action in patient care

Also Published As

Publication number Publication date
CN117993391A (en) 2024-05-07

Similar Documents

Publication Publication Date Title
WO2021068321A1 (en) Information pushing method and apparatus based on human-computer interaction, and computer device
CN110674319B (en) Label determining method, device, computer equipment and storage medium
CN112036154B (en) Electronic medical record generation method and device based on inquiry dialogue and computer equipment
WO2020048264A1 (en) Method and apparatus for processing drug data, computer device, and storage medium
CN107833603B (en) Electronic medical record document classification method and device, electronic equipment and storage medium
CN113157863B (en) Question-answer data processing method, device, computer equipment and storage medium
CN111259111B (en) Medical record-based decision-making assisting method and device, electronic equipment and storage medium
US9342489B2 (en) Automatic linking of requirements using natural language processing
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN111710383A (en) Medical record quality control method and device, computer equipment and storage medium
US11397756B2 (en) Data archiving method and computing device implementing same
CN117094334A (en) Data processing method, device and equipment based on large language model
CN112201359A (en) Artificial intelligence-based critical illness inquiry data identification method and device
CN111177375A (en) Electronic document classification method and device
CN116484867A (en) Named entity recognition method and device, storage medium and computer equipment
CN113111660A (en) Data processing method, device, equipment and storage medium
CN112037904B (en) Online diagnosis and treatment data processing method and device, computer equipment and storage medium
US20220318506A1 (en) Method and apparatus for event extraction and extraction model training, device and medium
CN117993391B (en) Medical named entity recognition and clinical term standardization method and device
CN115859984B (en) Medical named entity recognition model training method, device, equipment and medium
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN116186223A (en) Financial text processing method, device, equipment and storage medium
CN115114437A (en) Gastroscope text classification system based on BERT and double-branch network
CN115115432A (en) Artificial intelligence based product information recommendation method and device
US11423228B2 (en) Weakly supervised semantic entity recognition using general and target domain knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant