CN117350291A - Electronic medical record named entity identification method, device, equipment and storage medium - Google Patents

Electronic medical record named entity identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN117350291A
CN117350291A CN202311309766.3A CN202311309766A CN117350291A CN 117350291 A CN117350291 A CN 117350291A CN 202311309766 A CN202311309766 A CN 202311309766A CN 117350291 A CN117350291 A CN 117350291A
Authority
CN
China
Prior art keywords
feature information
semantic feature
token
real
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311309766.3A
Other languages
Chinese (zh)
Inventor
张兆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202311309766.3A priority Critical patent/CN117350291A/en
Publication of CN117350291A publication Critical patent/CN117350291A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present invention relates to the field of digital medical treatment, and in particular, to a method, an apparatus, a device, and a storage medium for identifying a named entity of an electronic medical record. Acquiring real-time text data of an electronic medical record, processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information, carrying out label semantic processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information, carrying out relevance calculation on each token semantic feature information and the plurality of label semantic feature information based on similarity to obtain label semantic feature information corresponding to the token semantic feature information, and carrying out named entity identification on the real-time text data to extract named entities corresponding to preset entity types. The invention fully learns token semantic feature information of the text, improves generalization of the model after migration, integrates semantic knowledge of the label, and performs relevance calculation, thereby improving efficiency and accuracy of named entity identification.

Description

Electronic medical record named entity identification method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence and digital medical treatment, in particular to a method, a device, equipment and a storage medium for identifying a named entity of an electronic medical record.
Background
With the rapid development and application of hospital information systems, large-scale electronic medical record data is accumulated in medical institutions. The data are important records generated in the hospital visit and treatment process of patients, and comprise various types of data such as medical record texts, medical charts, medical images and the like, so that medical staff can conveniently and rapidly use the medical data analysis system. The named entity identification work of the electronic medical record is an upstream work of medical information processing. Named entity recognition refers to identifying entities in text that have a particular meaning and categorizing them into predefined categories, such as diseases, treatments, symptoms, medicines, and the like.
Based on the complexity of medical scenes and limited labeling corpus, the prior art is solved by utilizing the thought of migration learning, pre-training is performed from a source data field (source domains), and then migration is performed to a target data field (target domains) to perform finishing. However, in practice, medical terms and expression modes between different professions and different hospitals are various, and data privacy problems prevent the different professions or hospitals from sharing data, so that generalization or migration effects of current practice are very limited in medical scenes, and particularly generalization capability on an unseen target domain (which is greatly different from a source data domain) is affected. Therefore, how to effectively improve generalization of the model after migration and realize accuracy of identifying the named entities of the electronic medical record has become a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
Based on the above, it is necessary to provide a method, a device and a storage medium for identifying a named entity of an electronic medical record, so as to solve the problem that the prior art cannot effectively improve generalization of a model after migration and realize accuracy of identifying the named entity of the electronic medical record.
A first aspect of an embodiment of the present application provides a method for identifying a named entity of an electronic medical record, where the method includes:
acquiring real-time text data of an electronic medical record, and processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data;
performing label semanteme processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data;
performing relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on the similarity to obtain tag semantic feature information corresponding to the token semantic feature information;
and carrying out named entity recognition on the real-time text data according to the label semantic feature information corresponding to the token semantic feature information so as to extract named entities corresponding to the preset entity types.
A second aspect of the embodiments of the present application provides an electronic medical record named entity recognition device, including:
the acquisition module is used for acquiring real-time text data of the electronic medical record, and processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data;
the processing module is used for carrying out label semantezation processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data;
the calculating module is used for carrying out relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on the similarity to obtain tag semantic feature information corresponding to the token semantic feature information;
and the extraction module is used for carrying out named entity identification on the real-time text data according to the label semantic feature information corresponding to the token semantic feature information so as to extract the named entity corresponding to the preset entity type.
In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the electronic medical record naming entity identifying method according to the first aspect is implemented.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the electronic medical record named entity identifying method according to the first aspect.
In summary, the invention provides a method, a device, equipment and a storage medium for identifying a named entity of an electronic medical record, which are used for acquiring real-time text data of the electronic medical record, processing the real-time text data by utilizing a comparison learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data, carrying out label semantic processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data, carrying out relevance calculation on each token semantic feature information and the plurality of label semantic feature information based on similarity to obtain label semantic feature information corresponding to the token semantic feature information, and further carrying out named entity identification on the real-time text data to extract a named entity corresponding to a preset entity type. The invention fully learns token semantic feature information of the text, improves generalization of the text after the model is migrated, simultaneously integrates semantic knowledge of the labels, carries out relevance calculation on each token semantic feature information and a plurality of label semantic feature information, can enhance similarity recognition of named entities, greatly reduces the requirement on manual labeling, can help the model to better complete the task of recognizing the named entities by acquiring the label semantic feature information corresponding to the token semantic feature information, fully utilizes the semantic information of the labels, and improves the efficiency and accuracy of recognizing the named entities.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application environment of a method for identifying named entities of an electronic medical record according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for identifying named entities of an electronic medical record according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an electronic medical record named entity recognition device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
The method for identifying the named entities of the electronic medical record provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The clients include, but are not limited to, palm top computers, desktop computers, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal digital assistant, PDA), and the like. The server can be realized by an independent server or a server cluster formed by a plurality of servers, and medical data such as personal health files, prescriptions, examination reports and the like can be uploaded and downloaded through the server.
It should be noted that, the method for identifying the named entity of the electronic medical record provided by the embodiment of the application is applied to the field of digital medical treatment, and utilizes a medical platform to output electronic medical record data corresponding to various medical texts, for example, the electronic medical record specifically refers to: digitized medical records are stored, managed, transmitted, and reproduced using electronic devices (computers, health cards, etc.) to replace all information of handwritten paper cases. The electronic medical record comprises various different types of documents such as project names, disease records, postoperative course, examination results, medical orders, operation records, admission records and the like, chapter types of the different types of documents are different (for example, the admission records comprise chapters such as main complaints, current medical history, family history and the like), reports can be acquired through a medical platform, and the medical platform converts the text examination reports into target text data and outputs the target text data to a user.
In one possible implementation, the medical text may be a medical electronic record (Electronic Healthcare Record), an electronic personal health record, a series of electronic records with saved inventory value including medical records, electrocardiography, medical images, and the like.
In one possible implementation manner, the method can be applied to intelligent diagnosis and treatment and remote consultation, and can also be used for intelligent customer service treatment of an internet hospital by utilizing the synthesized target voice for intelligent diagnosis and treatment and remote consultation.
Information inquiry is a channel for users to quickly acquire required information in many scenes. For example, in the medical field, medical record information required by a user can be queried from a large amount of electronic medical records based on an artificial intelligence model, and medical record reference can be provided for the user by outputting medical texts through voice.
It should be noted that the above application scenario related to medical treatment is only illustrative, and specific examples are not limited thereto. Referring to fig. 2, a flow chart of a method for identifying a named entity of an electronic medical record according to an embodiment of the present invention is shown, where the method for identifying a named entity of an electronic medical record may be applied to a server in fig. 1, and the server is connected to a corresponding client, as shown in fig. 2, and the method for identifying a named entity of an electronic medical record may include the following steps.
S201: and acquiring real-time text data of the electronic medical record, and processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data.
In step S201, the real-time text data of the electronic medical record obtained in the embodiment of the present application may be obtained from the electronic medical record database server; or, electronically scanning the paper medical record to obtain real-time text data of the electronic medical record, and processing the real-time text data by using a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data.
Optionally, the processing the real-time text data by using a contrast learning mode in the pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data includes:
pre-building a pre-trained language characterization model, wherein the pre-trained language characterization model comprises a Gaussian embedding layer;
inputting the real-time text data into a Gaussian embedded layer in a pre-trained language characterization model for comparison processing to obtain the distribution distance between each token in the real-time text data;
And determining token semantic feature information corresponding to the real-time text data according to the distribution distance between each token in the real-time text data.
In this embodiment, the current method of industry transfer learning is not efficient enough, is limited by the data tag and data distribution of the source data domain, learns semantic features and intermediate representations insufficiently, and has very limited generalization under different scenarios. Based on BERT, a Gaussian embedding layer is built to build a pre-trained language characterization model, so that the pre-trained language characterization model comprises a Gaussian embedding layer, and the model is better adapted to tasks in the field. The method comprises the steps of obtaining a text presentation containing rich semantic information, namely a text semantic Representation, by utilizing large-scale non-labeling corpus training, performing fine tuning on the text semantic Representation in a specific NLP task, and finally applying the text semantic Representation to the NLP task. And (3) an Embedding: embedding is also called mapping, which is to map sentences of word composition to a token vector. The method comprises the steps of inputting real-time text data into a Gaussian embedding layer in a pre-trained language characterization model, determining the distribution distance of a Gaussian embedding layer among various token in the real-time text data by means of contrast learning, enabling the model to try to reduce the distance of token embedding of similar entities, increasing the distance of token embedding of different entities, and further determining token semantic feature information corresponding to the real-time text data according to the distribution distance among each token in the real-time text data. For example, our input is "Barack Obama was born in 1961", where "Back Obama" is a name entity, we consider that the token ("Back", "Obama") within the entity should be relatively close in the ebadd distance, while "Back" and "Obama" should be further away from other token ("was", "born", "in", "1961") outside the entity, and in particular operation, when the token within the same entity, we give a smaller value of loss during training, and conversely a larger value of loss, allowing the model to learn the relationship between the entity and outside the entity. Such as: the calculated distance between the back and the Obama has smaller value, and the calculated distance between the Obama and the wass has larger value.
It should be noted that the pre-training language model may be gpt3, chatglm, bert, and the like, which is not limited in this application.
For example, for different patients, the application part and treatment target are different, so that the electronic medical record naming entity identification modes in each progress are also different. Therefore, in the technical scheme of the invention, the identification mode of the naming entity of the target electronic medical record can be determined according to the current medical instrument and the current progress of the rehabilitation data. In one possible implementation, the data is medical data, such as personal health records, prescriptions, exam reports, and the like.
According to the method and the device, the real-time text data of the electronic medical record are obtained, the real-time text data are processed by utilizing a contrast learning mode in the pre-trained language characterization model, so that token semantic feature information of the text is fully learned, the model can better capture the dependency relationship between labels, the intermediate representation and the semantic features are fully learned, and generalization of the model after migration is improved.
S202: and carrying out label semanteme processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data.
In step S202, the semantic features of the tag are not utilized in the present application due to the conventional entity extraction process. For example, 2 entities of "PER" and "LOC" are extracted, in the conventional model, the model only knows 2 entities of prediction 0 (PER) and 1 (LOC), but does not know what semantics are specifically represented by 0 and 1, so that by integrating tag knowledge, based on a preset indication tag, real-time text data is subjected to tag semanteme processing, and a plurality of tag semantic feature information corresponding to the real-time text data is obtained. For example, "PER" stands for Person, so we translate B-PER semantics into begin Person, I-PER semantics into inside Person, and so on. Semantically means that PER is converted into Begin person/entity person, and then the PER is encoded through BERT model, namely semantically characteristic information is generated.
In the embodiment of the invention, small models in various fields are not required to be designed under different medical scenes, the advantages of semantic features are fully utilized, the semantic feature information of the labels is fused, the generalization effect of the models in the target fields is enhanced, and the recognition efficiency of named entities is further improved.
S203: and carrying out relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on the similarity to obtain tag semantic feature information corresponding to the token semantic feature information.
In step S203, when the token semantic feature information corresponding to the real-time text data and the plurality of tag semantic feature information corresponding to the real-time text data are obtained, the relative degree of the token semantic feature information and the plurality of tag semantic feature information is determined, and further, correlation calculation is performed on each token semantic feature information and the plurality of tag semantic feature information based on the similarity, so as to obtain the tag semantic feature information corresponding to the token semantic feature information.
Optionally, performing relevance calculation on each token semantic feature information and a plurality of tag semantic feature information based on the similarity includes:
judging whether the similarity between each token semantic feature information and a plurality of label semantic feature information is larger than a preset similarity threshold value or not;
and if the similarity between each token semantic feature information and the plurality of tag semantic feature information is larger than a preset similarity threshold, performing relevance calculation on each token semantic feature information and the plurality of tag semantic feature information.
In this embodiment, by calculating the similarity between each token semantic feature information and the plurality of tag semantic feature information and setting a preset similarity threshold, whether the similarity between each token semantic feature information and the plurality of tag semantic feature information is greater than the preset similarity threshold is determined, if the similarity between each token semantic feature information and the plurality of tag semantic feature information is greater than the preset similarity threshold, the relevance calculation is performed on each token semantic feature information and the plurality of tag semantic feature information, and if the similarity between each token semantic feature information and the plurality of tag semantic feature information is not greater than the preset similarity threshold, the relevance calculation is not performed on each token semantic feature information and the plurality of tag semantic feature information, and the step of the electronic medical record named entity recognition method is required to be re-executed.
It should be noted that, the specific value of the preset similarity threshold may be set according to the actual requirement of the user, and the embodiment of the present application is not limited in any way.
Optionally, performing relevance calculation on each token semantic feature information and a plurality of tag semantic feature information, including:
for each token semantic feature information, fusing one token semantic feature information and a plurality of tag semantic feature information in advance to obtain different tag semantic feature information corresponding to the token semantic feature information;
sequentially classifying and scoring the semantic feature information of different labels corresponding to the token semantic feature information to obtain a target score set;
sequentially sequencing each score value in the target score set from large to small, and selecting the maximum value as the label semantic feature information which corresponds to the token semantic feature information and is closest;
repeating the relevance calculating step for other token semantic feature information until the label semantic feature information corresponding to all token semantic feature information is determined to be completed.
In this embodiment, when performing relevance calculation on each token semantic feature information and multiple tag semantic feature information, first, fusing one token semantic feature information with multiple tag semantic feature information to obtain different tag semantic feature information corresponding to the token semantic feature information, then sequentially classifying and scoring the different tag semantic feature information corresponding to the token semantic feature information to obtain a target score set, sequentially sorting each score value in the target score set from large to small, selecting the maximum value as the tag semantic feature information corresponding to the token semantic feature information most similar through priority, repeating the relevance calculation step for the rest other token semantic feature information, and sequentially analogizing until the tag semantic feature information corresponding to all the token semantic feature information is completed. For example, 2 encoders (encodings) are constructed through a BERT model, input token semantic feature information and label semantic feature information are encoded respectively, then the semantic feature information of each token of an input text is associated with semantic feature information of a plurality of labels (labels) for calculation, labels (labels) closest to the token are obtained, and the labels corresponding to all the token are calculated by analogy, so that extraction of named entities can be completed. The extraction process is very general, and for different labels in different scenes, redesign or retraining is not needed (2 encoders can already encode the input token semantic feature information well and the label semantic feature information).
In the embodiment, relevance calculation is performed on each token semantic feature information and a plurality of tag semantic feature information based on the similarity, so that the tag semantic feature information corresponding to the token semantic feature information is obtained, a better completion task of named entity recognition by a model is ensured, the semantic information of the tag is fully utilized, and the efficiency and accuracy of named entity recognition are improved.
S204: and carrying out named entity recognition on the real-time text data according to the label semantic feature information corresponding to the token semantic feature information so as to extract named entities corresponding to the preset entity types.
In step S204, after the label semantic feature information corresponding to the token semantic feature information, further, according to the label semantic feature information corresponding to the token semantic feature information, named entity recognition is performed on the real-time text data, so as to extract a named entity corresponding to the preset entity type.
Optionally, performing named entity recognition on the real-time text data to extract a named entity corresponding to the preset entity type, including:
carrying out named entity recognition on the real-time text data according to label semantic feature information corresponding to all token semantic feature information to obtain entity attribute identifiers corresponding to all token in the real-time text data;
Judging whether entity attribute identifiers corresponding to all token in the real-time text data are matched with a preset entity type or not;
and if the entity attribute identifiers corresponding to the token in the real-time text data are matched with the preset entity types, extracting named entities corresponding to the preset entity types.
In one embodiment, after determining tag semantic feature information corresponding to all token semantic feature information, using a preset named entity recognition model to perform named entity recognition on the real-time text data, and obtaining entity attribute identifiers corresponding to all the tokens in the real-time text data, where the entity attribute identifiers are used to indicate whether all the tokens in the real-time text data belong to named entities. After determining that each token in the real-time text data belongs to a named entity, judging whether entity attribute identifiers corresponding to each token in the real-time text data are matched with a preset entity type, and if the entity attribute identifiers corresponding to each token in the real-time text data are matched with the preset entity type, extracting the named entity corresponding to the preset entity type. If the entity attribute identifier corresponding to each token in the real-time text data is not matched with the preset entity type, the named entity corresponding to the preset entity type cannot be extracted, and the step of the electronic medical record named entity identification method needs to be executed again.
Optionally, before extracting the named entity corresponding to the preset entity type, the method includes:
pre-establishing a named entity index based on a plurality of tag semantic feature information;
and inputting the token semantic feature information into the named entity index, and directly extracting the named entity corresponding to the preset entity type.
In this embodiment, before extracting the named entity corresponding to the preset entity type, we can calculate all the tag semantic feature information in the target domain scene in advance and save the tag semantic feature information, and then build the named entity index (redis can be used as a storage and design index) for all the tag semantic feature information in the target domain scene, so that when extracting the named entity corresponding to the preset entity type, only the input token semantic feature information needs to be calculated each time, the named entity of the electronic medical record can be extracted quickly, for example, when the related computation of the token semantic feature information and the tag semantic characterization information is needed, for the same scene, the types of the tags to be extracted are fixed, such as: under the condition of extracting the B ultrasonic report, only the report time, the doctor for delivery, the hospital, the disease and the part are extracted, and then the labels are coded by a BERT model in advance, so that token semantic feature information is obtained, and repeated calculation is not needed each time a named entity is extracted.
In the embodiment, under a complex and difficultly marked medical scene, the named entity identification is performed on the real-time text data through the label semantic feature information corresponding to the token semantic feature information so as to extract the named entity corresponding to the preset entity type, so that the advantages of the token semantic feature information are fully utilized, the label semantic feature information is fused, the generalization effect of the model is enhanced, the related index is designed to store the semantic label of the label, and the speed of model reasoning is greatly improved.
In summary, the invention provides a method, a device, equipment and a storage medium for identifying a named entity of an electronic medical record, which are used for acquiring real-time text data of the electronic medical record, processing the real-time text data by utilizing a comparison learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data, carrying out label semantic processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data, carrying out relevance calculation on each token semantic feature information and the plurality of label semantic feature information based on similarity to obtain label semantic feature information corresponding to the token semantic feature information, and further carrying out named entity identification on the real-time text data to extract a named entity corresponding to a preset entity type. The invention fully learns token semantic feature information of the text, improves generalization of the text after the model is migrated, simultaneously integrates semantic knowledge of the labels, carries out relevance calculation on each token semantic feature information and a plurality of label semantic feature information, can enhance similarity recognition of named entities, greatly reduces the requirement on manual labeling, can help the model to better complete the task of recognizing the named entities by acquiring the label semantic feature information corresponding to the token semantic feature information, fully utilizes the semantic information of the labels, and improves the efficiency and accuracy of recognizing the named entities.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic medical record named entity recognition device according to an embodiment of the invention. The terminal in this embodiment includes units for executing the steps in the embodiment corresponding to fig. 2. Refer specifically to fig. 2 and the related description in the embodiment corresponding to fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 3, the electronic medical record named entity recognition device 30 includes: the device comprises an acquisition module 31, a processing module 32, a calculation module 33 and an extraction module 34.
The acquiring module 31 is configured to acquire real-time text data of an electronic medical record, and process the real-time text data by using a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data;
the processing module 32 is configured to perform tag semanteme processing on the real-time text data based on a preset indication tag, so as to obtain a plurality of tag semantic feature information corresponding to the real-time text data;
a calculating module 33, configured to perform relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on similarity, so as to obtain tag semantic feature information corresponding to the token semantic feature information;
And the extraction module 34 is configured to identify a named entity of the real-time text data according to the tag semantic feature information corresponding to the token semantic feature information, so as to extract the named entity corresponding to the preset entity type.
Optionally, the above-mentioned obtaining module 31 is specifically configured to:
pre-building a pre-trained language characterization model, wherein the pre-trained language characterization model comprises a Gaussian embedding layer;
inputting the real-time text data into a Gaussian embedded layer in a pre-trained language characterization model for comparison processing to obtain the distribution distance between each token in the real-time text data;
and determining token semantic feature information corresponding to the real-time text data according to the distribution distance between each token in the real-time text data.
Optionally, the above-mentioned calculation module 33 is specifically configured to:
judging whether the similarity between each token semantic feature information and a plurality of label semantic feature information is larger than a preset similarity threshold value or not;
and if the similarity between each token semantic feature information and the plurality of tag semantic feature information is larger than a preset similarity threshold, performing relevance calculation on each token semantic feature information and the plurality of tag semantic feature information.
Optionally, the above-mentioned calculation module 33 is further configured to:
for each token semantic feature information, fusing one token semantic feature information and a plurality of tag semantic feature information in advance to obtain different tag semantic feature information corresponding to the token semantic feature information;
sequentially classifying and scoring the semantic feature information of different labels corresponding to the token semantic feature information to obtain a target score set;
sequentially sequencing each score value in the target score set from large to small, and selecting the maximum value as the label semantic feature information which corresponds to the token semantic feature information and is closest;
repeating the relevance calculating step for other token semantic feature information until the label semantic feature information corresponding to all token semantic feature information is determined to be completed.
Optionally, the extraction module 34 is specifically configured to:
carrying out named entity recognition on the real-time text data according to label semantic feature information corresponding to all token semantic feature information to obtain entity attribute identifiers corresponding to all token in the real-time text data;
judging whether entity attribute identifiers corresponding to all token in the real-time text data are matched with a preset entity type or not;
And if the entity attribute identifiers corresponding to the token in the real-time text data are matched with the preset entity types, extracting named entities corresponding to the preset entity types.
Optionally, the foregoing extraction module 34 is specifically configured to:
pre-establishing a named entity index based on a plurality of tag semantic feature information;
and inputting the token semantic feature information into the named entity index, and directly extracting the named entity corresponding to the preset entity type.
It should be noted that, because the content of information interaction and execution process between the above units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 4, the computer device of this embodiment includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The steps in any of the above embodiments of the electronic medical record named entity identification method are implemented when the computer program is executed by a processor.
The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 4 is merely an example of a computer device and is not intended to limit the computer device, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.
In an embodiment, a computer readable storage medium is provided, where instructions in the computer readable storage medium, when executed by a processor in a computer device, enable the computer device to perform the steps of any embodiment of the electronic medical record named entity identification method as disclosed in the present invention, are not repeated herein. The computer readable storage medium may be nonvolatile or may be volatile.
The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. The method for identifying the named entity of the electronic medical record is characterized by comprising the following steps of:
Acquiring real-time text data of an electronic medical record, and processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data;
performing label semanteme processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data;
performing relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on the similarity to obtain tag semantic feature information corresponding to the token semantic feature information;
and carrying out named entity recognition on the real-time text data according to the label semantic feature information corresponding to the token semantic feature information so as to extract named entities corresponding to the preset entity types.
2. The method for identifying a named entity of an electronic medical record according to claim 1, wherein the processing the real-time text data by using a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data comprises:
pre-building a pre-trained language characterization model, wherein the pre-trained language characterization model comprises a Gaussian embedding layer;
Inputting the real-time text data into a Gaussian embedded layer in a pre-trained language characterization model for comparison processing to obtain the distribution distance between each token in the real-time text data;
and determining token semantic feature information corresponding to the real-time text data according to the distribution distance between each token in the real-time text data.
3. The electronic medical record named entity recognition method of claim 1, wherein performing association calculation on each token semantic feature information and a plurality of tag semantic feature information based on similarity comprises:
judging whether the similarity between each token semantic feature information and a plurality of label semantic feature information is larger than a preset similarity threshold value or not;
and if the similarity between each token semantic feature information and the plurality of tag semantic feature information is larger than a preset similarity threshold, performing relevance calculation on each token semantic feature information and the plurality of tag semantic feature information.
4. The electronic medical record named entity recognition method of claim 3, wherein said performing a correlation calculation on each token semantic feature information and a plurality of tag semantic feature information comprises:
For each token semantic feature information, fusing one token semantic feature information and a plurality of tag semantic feature information in advance to obtain different tag semantic feature information corresponding to the token semantic feature information;
sequentially classifying and scoring the semantic feature information of different labels corresponding to the token semantic feature information to obtain a target score set;
sequentially sequencing each score value in the target score set from large to small, and selecting the maximum value as the label semantic feature information which corresponds to the token semantic feature information and is closest;
repeating the relevance calculating step for other token semantic feature information until the label semantic feature information corresponding to all token semantic feature information is determined to be completed.
5. The method for identifying named entities of an electronic medical record according to claim 1, wherein the step of identifying named entities of the real-time text data to extract named entities corresponding to a predetermined entity type comprises the steps of:
carrying out named entity recognition on the real-time text data according to label semantic feature information corresponding to all token semantic feature information to obtain entity attribute identifiers corresponding to all token in the real-time text data;
Judging whether entity attribute identifiers corresponding to all token in the real-time text data are matched with a preset entity type or not;
and if the entity attribute identifiers corresponding to the token in the real-time text data are matched with the preset entity types, extracting named entities corresponding to the preset entity types.
6. The method for identifying a named entity of an electronic medical record according to claim 5, wherein before extracting the named entity corresponding to the preset entity type, the method comprises:
pre-establishing a named entity index based on a plurality of tag semantic feature information;
and inputting the token semantic feature information into the named entity index, and directly extracting the named entity corresponding to the preset entity type.
7. An electronic medical record named entity recognition device, characterized by comprising:
the acquisition module is used for acquiring real-time text data of the electronic medical record, and processing the real-time text data by utilizing a contrast learning mode in a pre-trained language characterization model to obtain token semantic feature information corresponding to the real-time text data;
the processing module is used for carrying out label semantezation processing on the real-time text data based on a preset indication label to obtain a plurality of label semantic feature information corresponding to the real-time text data;
The calculating module is used for carrying out relevance calculation on each token semantic feature information and the plurality of tag semantic feature information based on the similarity to obtain tag semantic feature information corresponding to the token semantic feature information;
and the extraction module is used for carrying out named entity identification on the real-time text data according to the label semantic feature information corresponding to the token semantic feature information so as to extract the named entity corresponding to the preset entity type.
8. The electronic medical record named entity recognition device of claim 7, wherein the processing module comprises:
the device comprises a building unit, a pre-training unit and a processing unit, wherein the building unit is used for pre-building a pre-training language characterization model, and the pre-training language characterization model comprises a Gaussian embedding layer;
the comparison unit is used for inputting the real-time text data into a Gaussian embedding layer in a pre-trained language characterization model for comparison processing to obtain the distribution distance between each token in the real-time text data;
and the determining unit is used for determining the token semantic feature information corresponding to the real-time text data according to the distribution distance between each token in the real-time text data.
9. A computer device comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the electronic medical record named entity recognition method of any one of claims 1 to 6 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the electronic medical record named entity recognition method of any one of claims 1 to 6.
CN202311309766.3A 2023-10-10 2023-10-10 Electronic medical record named entity identification method, device, equipment and storage medium Pending CN117350291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311309766.3A CN117350291A (en) 2023-10-10 2023-10-10 Electronic medical record named entity identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311309766.3A CN117350291A (en) 2023-10-10 2023-10-10 Electronic medical record named entity identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117350291A true CN117350291A (en) 2024-01-05

Family

ID=89364434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311309766.3A Pending CN117350291A (en) 2023-10-10 2023-10-10 Electronic medical record named entity identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117350291A (en)

Similar Documents

Publication Publication Date Title
US11790171B2 (en) Computer-implemented natural language understanding of medical reports
CN111709233B (en) Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
CN107833603B (en) Electronic medical record document classification method and device, electronic equipment and storage medium
CN109215754A (en) Medical record data processing method, device, computer equipment and storage medium
CA3137096A1 (en) Computer-implemented natural language understanding of medical reports
CN112131393A (en) Construction method of medical knowledge map question-answering system based on BERT and similarity algorithm
CN112015917A (en) Data processing method and device based on knowledge graph and computer equipment
CN113724819B (en) Training method, device, equipment and medium for medical named entity recognition model
WO2022068160A1 (en) Artificial intelligence-based critical illness inquiry data identification method and apparatus, device, and medium
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
WO2023178978A1 (en) Prescription review method and apparatus based on artificial intelligence, and device and medium
Cao et al. Automatic ICD code assignment based on ICD’s hierarchy structure for Chinese electronic medical records
US11532387B2 (en) Identifying information in plain text narratives EMRs
CN113450905A (en) Medical auxiliary diagnosis system, method and computer readable storage medium
CN113724830B (en) Medication risk detection method based on artificial intelligence and related equipment
CN113722507B (en) Hospitalization cost prediction method and device based on knowledge graph and computer equipment
CN116861875A (en) Text processing method, device, equipment and storage medium based on artificial intelligence
CN116578704A (en) Text emotion classification method, device, equipment and computer readable medium
CN117350291A (en) Electronic medical record named entity identification method, device, equipment and storage medium
CN114912452A (en) Method and device for entity identification and information extraction
CN117350292A (en) Method, device, equipment and storage medium for extracting named entities of electronic medical records
Farrelly et al. Current Topological and Machine Learning Applications for Bias Detection in Text
CN116796840A (en) Medical entity information extraction method, device, computer equipment and storage medium
Smitha et al. VOICE RECOGNITION BASED MEDI ASSISTANT
CN114218351A (en) Text retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination