CN116956925A

CN116956925A - Electronic medical record named entity identification method and device, electronic equipment and storage medium

Info

Publication number: CN116956925A
Application number: CN202310929939.5A
Authority: CN
Inventors: 张兆
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-10-27

Abstract

The embodiment of the application provides a method and a device for identifying a named entity of an electronic medical record, electronic equipment and a storage medium, and belongs to the technical field of digital medical treatment. The method comprises the following steps: acquiring electronic medical record data; the electronic medical record data comprises medical record text data; word segmentation processing is carried out on the medical record text data to obtain a medical record word sequence; extracting features of the medical record word sequence through a first branch network of a named entity recognition model to obtain text features of medical records, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network; fusion processing is carried out on the medical record text characteristics and the vocabulary data based on the characteristic fusion network, so that fusion text characteristics are obtained; performing entity extraction on the fused text features based on the second branch network to obtain fused text entity features; and carrying out entity recognition on the fused text entity characteristics based on the recognition network to obtain the entity types of the fused text entity characteristics. The embodiment of the application can improve the accuracy of named entity identification.

Description

Electronic medical record named entity identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of digital medical technology, and in particular, to a method and apparatus for identifying a named entity of an electronic medical record, an electronic device, and a storage medium.

Background

Named entity recognition (Name Entity Recognition, NER) techniques can be used to identify specific entity information in text, such as person names, place names, organization names, etc., and named entity recognition is widely used in the fields of information extraction, information retrieval, intelligent question-answering, machine translation, etc. Typically, the named entity recognition task is formed as a sequence tagging task and the entity boundaries and entity types are jointly predicted by predicting each word or the tag of each word.

In the field of digital medical treatment, entity names related to medical treatment contained in text data of electronic medical records are classified into preset categories through a named entity recognition technology, so that a great pushing effect can be achieved on aspects of medical information retrieval, question-answer dialogue in intelligent consultation, disease information extraction and the like.

At present, most of electronic medical record naming entity recognition methods often recognize entity features in electronic medical record data by introducing a medical dictionary, and the method often cannot select proper vocabulary according to different medical scenes, so that the boundary judgment of a naming entity is wrong, and the problem of low accuracy of entity recognition exists.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method and a device for identifying a named entity of an electronic medical record, electronic equipment and a storage medium, and aims to improve the accuracy of identifying the named entity of the electronic medical record.

To achieve the above object, a first aspect of an embodiment of the present application provides a method for identifying a named entity of an electronic medical record, where the method includes:

acquiring electronic medical record data; wherein the electronic medical record data comprises medical record text data;

word segmentation is carried out on the medical record text data to obtain a medical record word sequence;

extracting features of the medical record word sequence through a first branch network of a preset named entity recognition model to obtain medical record text features, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network;

performing fusion processing on the medical record text characteristics and the pre-acquired vocabulary data based on the characteristic fusion network to obtain fusion text characteristics;

performing entity extraction on the fused text features based on the second branch network to obtain fused text entity features;

and carrying out entity recognition on the fused text entity characteristics based on the recognition network to obtain the entity type of the fused text entity characteristics.

In some embodiments, the feature extraction of the medical record word sequence through the first branch network of the preset named entity recognition model to obtain a medical record text feature includes:

performing word embedding processing on medical record words in the medical record word sequence through the first branch network to obtain word embedding vectors;

performing multi-head attention calculation on the word embedding vector to obtain a first attention calculation result;

and normalizing the first attention calculation result to obtain the medical record text characteristic.

In some embodiments, the fusing processing is performed on the medical record text feature and the pre-acquired vocabulary data based on the feature fusion network to obtain a fused text feature, including:

performing bilinear attention calculation on the medical record text characteristics and the vocabulary data based on the characteristic fusion network to obtain a second attention calculation result;

performing semantic enhancement on the second attention calculation result based on the medical record text characteristics to obtain intermediate text characteristics;

and carrying out standardization processing on the intermediate text features to obtain the fusion text features.

In some embodiments, the performing bilinear attention calculation on the medical record text feature and the vocabulary data based on the feature fusion network to obtain a second attention calculation result includes:

Performing feature multiplication on the medical record text features and the vocabulary data to obtain a first feature matrix;

performing sum pooling treatment on the first feature matrix to obtain a second feature matrix;

vectorizing the second feature matrix to obtain a bilinear feature vector;

normalizing the bilinear feature vector to obtain normalized features;

and performing attention calculation on the normalized features based on a preset function to obtain the second attention calculation result.

In some embodiments, the method further includes obtaining the vocabulary data, specifically including:

dividing the medical record text data to obtain text characters;

traversing a preset scanning dictionary, and selecting medical words containing the text characters in the scanning dictionary as candidate words;

and screening target words from the candidate words, and integrating the target words into the vocabulary data.

In some embodiments, the screening the target word from the plurality of candidate words includes:

acquiring part-of-speech categories of the candidate words;

and screening the target words from a plurality of candidate words based on the part-of-speech category.

In some embodiments, the performing entity recognition on the fused text entity feature based on the recognition network to obtain an entity type of the fused text entity feature includes:

performing entity type scoring on the fused text entity characteristics based on the identification network to obtain type scoring data of the fused text entity characteristics;

and obtaining the entity type of the fused text entity characteristic according to the type scoring data.

To achieve the above object, a second aspect of an embodiment of the present application provides an electronic medical record named entity recognition device, including:

the data acquisition module is used for acquiring electronic medical record data, wherein the electronic medical record data comprises medical record text data;

the word segmentation module is used for carrying out word segmentation processing on the medical record text to obtain a medical record word sequence;

the feature extraction module is used for extracting features of the medical record word sequence through a first branch network of a preset named entity recognition model to obtain medical record text features, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network;

the fusion module is used for carrying out fusion processing on the medical record text characteristics and the pre-acquired vocabulary data based on the characteristic fusion network to obtain fusion text characteristics;

The entity extraction module is used for extracting the entity of the fusion text feature based on the second branch network to obtain the fusion text entity feature;

and the entity identification module is used for carrying out entity identification on the fused text entity characteristics based on the identification network to obtain the entity type of the fused text entity characteristics.

To achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The application provides a method for identifying a named entity of an electronic medical record, a device for identifying the named entity of the electronic medical record, electronic equipment and a storage medium, wherein the electronic equipment is used for acquiring data of the electronic medical record; wherein the electronic medical record data comprises medical record text data; and performing word segmentation processing on the medical record text data to obtain a medical record word sequence, so that named entity recognition can be performed on the electronic medical record on a word level. Further, feature extraction is carried out on the medical record word sequence through a first branch network of a preset named entity recognition model to obtain the text feature of the medical record, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network; the medical record text features and the pre-acquired vocabulary data are fused based on the feature fusion network to obtain the fused text features, vocabulary knowledge can be introduced in the named entity recognition process, the requirement for manual labeling can be greatly reduced, the most suitable vocabulary and the medical record text features can be adaptively selected for fusion according to different contexts and different scenes in the named entity recognition process, and the feature quality and feature content comprehensiveness of the fused text features are improved. Further, entity extraction is carried out on the fused text features based on the second branch network to obtain fused text entity features, the fused text entity features can be extracted in a relatively aspect, finally, entity identification is carried out on the fused text entity features based on the identification network to obtain entity types of the fused text entity features, the specific types of the named entity features can be accurately identified, and accuracy of identifying the named entity of the electronic medical record is improved.

Drawings

FIG. 1 is a flowchart of a method for identifying named entities of an electronic medical record according to an embodiment of the present application;

fig. 2 is a flowchart of step S103 in fig. 1;

FIG. 3 is another flowchart of a method for identifying named entities of an electronic medical record according to an embodiment of the present application;

fig. 4 is a flowchart of step S303 in fig. 3;

fig. 5 is a flowchart of step S104 in fig. 1;

fig. 6 is a flowchart of step S501 in fig. 5;

fig. 7 is a flowchart of step S106 in fig. 1;

FIG. 8 is a schematic structural diagram of an electronic medical record named entity recognition device according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

BERT (Bidirectional Encoder Representations from Transformers): is a language representation model (language representation model). BERT uses Transformer Encoder block for concatenation, a typical bi-directional coding model.

Conditional random field algorithm (conditional random field algorithm, CRF): is a mathematical algorithm; the characteristics of the maximum entropy model and the hidden Markov model are combined, the model is an undirected graph model, and good effects are achieved in sequence labeling tasks such as word segmentation, part-of-speech labeling, named entity recognition and the like in recent years. Conditional random fields are a typical discriminant model whose joint probabilities can be written in the form of a number of potential function multiplications, with linear chain member random fields being the most common. Let x= (x 1, x2, … xn) denote the observed input data sequence, y= (y 1, y2, … yn) denote a state sequence, given an input sequence, the CRF model of the linear chain defines the joint conditional probability of the state sequence as p (y|x) =exp { } (2-14); z (x) = { } (2-15); wherein Z is a probability normalization factor on the condition of observing the sequence x; fj (yi-1, yi, x, i) is an arbitrary characteristic function.

In the field of digital medical treatment, entity names related to medical treatment contained in text data of electronic medical records are classified into preset categories through a named entity recognition technology, so that a great pushing effect can be achieved on aspects of medical information retrieval, question-answer dialogue in intelligent consultation, disease information extraction and the like. Wherein the predetermined category includes diseases, drugs, symptoms, treatments, and the like.

Based on the above, the embodiment of the application provides a method for identifying a named entity of an electronic medical record, a device for identifying the named entity of the electronic medical record, electronic equipment and a storage medium, aiming at improving the accuracy of identifying the named entity of the electronic medical record.

The method and device for identifying the named entities of the electronic medical record, the electronic equipment and the storage medium provided by the embodiment of the application are specifically described through the following embodiments, and the method for identifying the named entities of the electronic medical record in the embodiment of the application is described first.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a method for identifying a named entity of an electronic medical record, and relates to the technical field of digital medical treatment. The electronic medical record naming entity identification method provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the electronic medical record named entity recognition method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments of the present application, when related processing is required according to data related to identity or characteristics of an object, such as object information, object behavior data, object history data, and object position information, permission or consent of the object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, and the like of the data. In addition, when the embodiment of the application needs to acquire the personal information of the object, the independent permission or independent agreement of the object is acquired through a popup window or a jump to a confirmation page and the like, and after the independent permission or independent agreement of the object is explicitly acquired, the necessary object related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a method for identifying a named entity of an electronic medical record according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, acquiring electronic medical record data; wherein the electronic medical record data comprises medical record text data;

step S102, word segmentation processing is carried out on the medical record text data to obtain a medical record word sequence;

step S103, extracting features of a medical record word sequence through a first branch network of a preset named entity recognition model to obtain a text feature of the medical record, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network;

step S104, fusion processing is carried out on the medical record text characteristics and the pre-acquired vocabulary data based on the characteristic fusion network, so as to obtain fusion text characteristics;

step S105, entity extraction is carried out on the fused text features based on the second branch network, and fused text entity features are obtained;

and step S106, carrying out entity recognition on the fused text entity characteristics based on the recognition network to obtain the entity types of the fused text entity characteristics.

Step S101 to step S106 shown in the embodiment of the application are implemented by acquiring electronic medical record data; wherein the electronic medical record data comprises medical record text data; and performing word segmentation processing on the medical record text data to obtain a medical record word sequence, so that named entity recognition can be performed on the electronic medical record on a word level. Further, feature extraction is carried out on the medical record word sequence through a first branch network of a preset named entity recognition model to obtain the text feature of the medical record, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network; the medical record text features and the pre-acquired vocabulary data are fused based on the feature fusion network to obtain the fused text features, vocabulary knowledge can be introduced in the named entity recognition process, the requirement for manual labeling can be greatly reduced, the most suitable vocabulary and the medical record text features can be adaptively selected for fusion according to different contexts and different scenes in the named entity recognition process, and the feature quality and feature content comprehensiveness of the fused text features are improved. Further, entity extraction is carried out on the fused text features based on the second branch network to obtain fused text entity features, the fused text entity features can be extracted in a relatively aspect, finally, entity identification is carried out on the fused text entity features based on the identification network to obtain entity types of the fused text entity features, the specific types of the named entity features can be accurately identified, and accuracy of identifying the named entity of the electronic medical record is improved.

In step S101 of some embodiments, data extraction may be performed on a preset data source by a web crawler to obtain electronic medical record data of a target object, where the preset data source may be a medical cloud platform or a medical database, and the target object may be a common patient, a medical case, and the like, without limitation. The electronic medical record data may be a medical consultation record of the target object, where the electronic medical record data includes medical record text data, historical disease data, historical medication data, historical operation data, and the like of the target object, the historical disease data includes disease description information of the target object in a certain medical consultation, the historical medication data includes drug usage information of the target object in a certain medical consultation, and the historical operation data includes operation description information of the target object in a certain medical consultation. The medical record text data comprises information such as complaints, disease descriptions, diagnosis results and the like of the target object in a certain medical consultation.

In one possible implementation, the electronic medical record data is a natural language text, and the electronic medical record data may be a medical electronic record (Electronic Healthcare Record), an electronic personal health record, including a series of electronic records with a saved and checked value, such as medical records, electrocardiography, and medical images.

In step S102 of some embodiments, when the medical record text data is subjected to word segmentation processing to obtain a medical record word sequence, a Jieba word segmentation tool may be used to segment the medical record text data. Specifically, firstly, constructing a prefix dictionary based on a preset statistical dictionary through a Jieba word segmentation tool; then, each sentence of the medical record text data is segmented by utilizing the prefix dictionary, all possible word forming conditions are generated, and segmentation results are obtained; then, a directed acyclic graph is constructed based on the segmentation locations indicated by the segmentation results. Further, calculating to obtain a maximum probability path through a dynamic programming algorithm, and taking a segmentation result corresponding to the maximum probability path as a final segmentation result, thereby obtaining a medical record phrase sequence. The method can effectively divide words efficiently and accurately.

In the embodiment of the application, the named entity recognition model is constructed based on the BERT model. Specifically, the named entity recognition model comprises a first branch network, a feature fusion network, a second branch network and a recognition network, wherein the first branch network and the second branch network are both in a transformer structure, the feature fusion network comprises a bilinear attention layer and a standardization layer, and the recognition network is constructed based on a conditional random field algorithm (conditional random field, CRF).

Referring to fig. 2, in some embodiments, step S103 may include, but is not limited to, steps S201 to S203:

step S201, word embedding processing is carried out on medical record words in the medical record word sequence through a first branch network, so as to obtain word embedding vectors;

step S202, performing multi-head attention calculation on word embedding vectors to obtain a first attention calculation result;

step S203, the first attention calculation result is normalized to obtain the text characteristics of the medical record.

The following describes the above steps S201 to S203 in detail.

The first branch network includes an embedded layer, a bidirectional transducer coding layer, and an output layer. The embedding layer is the input of the first branch network, which is the sum of word embedding, position embedding and type embedding, and respectively represents word information, position information and sentence pair information. The transducer consists of a self-attention mechanism and a feedforward neural network, wherein the working principle of the self-attention mechanism is mainly to calculate the association degree between words in a text sequence, and the magnitude of a weight coefficient is adjusted according to the magnitude of the association degree, and the calculation method of the association degree is shown in a formula (1):

wherein Q represents the query vector, K represents the key vector, V represents the value vector, and penalty factors are introduced so that the inner products of the query vector Q and the value vector V are not excessive Wherein d _k Representing the input vector dimension.

In step S201 of some embodiments, in the embodiments of the present application, word embedding processing is performed on medical record words in the sequence of medical record words by the embedding layer, so as to obtain word information of the medical record words, and obtain word embedding vectors.

In step S202 of some embodiments, the transform coding structure uses a multi-head Attention mechanism, i.e. Q, K, V is mapped multiple times to different linearities, and the resulting new Q, K, V is recalculated to different Attention (Q, K, V) and spliced. Specifically, for each word embedding vector, a query vector Q, a key vector K, and a value vector V corresponding to the word embedding vector are first generated by a bidirectional transducer coding layer. And then, carrying out different linear mapping on the query vector Q, the key vector K and the value vector V for a plurality of times, re-calculating the obtained new query vector Q, the key vector K and the value vector V to obtain calculation results Attention (Q, K, V) of different Attention heads, and splicing all the obtained calculation results Attention (Q, K, V) to obtain a first Attention calculation result. Wherein the first attention calculation result may be expressed as shown in formula (2):

MultiHead(Q，K，V)＝concat(head ₁ ,head ₂ ,…,head _i ,…,head _n ) W formula (2)

Wherein MultiHead (Q, K, V) refers to the first attention calculation result, head _i Is the calculation result of the ith attention head, and W is a preset weight matrix.

Wherein, the calculation result of the ith attention head can be expressed as shown in formula (3):

wherein,,is the query vector weight of the ith attention header,/->Is the key vector weight of the ith attention header,/-, for example>Is the value vector weight of the i-th attention header.

In step S203 of some embodiments, when the first attention calculation result is normalized, a multiplication operation may be performed on the first attention calculation result by using a preset matrix parameter and a bias vector, so as to obtain a medical record text feature. Specifically, the process of normalizing the first attention calculation result may be expressed as shown in formula (4):

M＝max(0，W ₁ *Z+b ₁ )*W ₂ +b ₂ formula (4)

Wherein M is the text feature of the medical record, Z is the first attention calculation result, W ₁ 、W ₂ Is a preset matrix parameter and is determined according to actual conditions; b ₁ And b ₂ Is a preset offset vector and is determined according to actual conditions.

Through the steps S201 to S203, the first branch network can be used for conveniently extracting the word information of the medical record words in the medical record word sequence, so that the semantic content of the medical record word sequence is effectively identified, and the medical record text characteristics of the text characteristic information of the representation electronic medical record data are obtained.

Referring to fig. 3, before step S104 of some embodiments, the electronic medical record named entity recognition method may include obtaining vocabulary data, including but not limited to steps S301 to S303:

step S301, segmentation processing is carried out on the medical record text data to obtain text characters;

step S302, traversing a preset scanning dictionary, and selecting medical words containing text characters in the scanning dictionary as candidate words;

step S303, screening target words from the plurality of candidate words, and integrating the target words into vocabulary data.

The following describes the above steps S301 to S303 in detail.

In step S301 of some embodiments, the common text segmentation software may be used to segment the medical record text data, convert the medical record text data from sentence level to word level, and split each sentence in the medical record text data into a plurality of characters to obtain text characters.

For example, a sentence "subject had hypertension" in the case history text data is segmented, and text characters including "subject", "object", "subject", "patient", "high", "blood", "pressure", "symptom" are obtained.

In step S302 of some embodiments, the preset scan dictionary includes a plurality of medical terms configured according to medical knowledge and terms commonly used in the medical field. Specifically, traversing a scanning dictionary, querying medical words containing text characters in the scanning dictionary, and taking the queried medical words as candidate words.

In step S303 of some embodiments, all candidate words may be first used as target words, or a part of candidate words may be selected from a plurality of candidate words according to a screening condition, which is not limited. Then, integrating the target words into the same set to obtain vocabulary data.

Through the steps S301 to S303, vocabulary expansion can be performed according to text characters in the medical record text data, more vocabulary knowledge is introduced, so that the named entity recognition process can be suitable for complex medical scenes, ambiguity caused by different contexts on the use of phrases is reduced, and accuracy of recognition of named entities of electronic medical records is improved.

As the number of target words is large due to the fact that all candidate words are used as target words, the calculated amount of the named entity recognition model is easy to increase, and the efficiency of named entity recognition is not improved. Based on the above, the embodiment of the application takes the part of speech class of the candidate words or scene characteristics of different medical scenes into consideration to screen out a part of the candidate words as the target words, so that the number of the target words can be effectively reduced, the calculation efficiency of the model can be improved, and the recognition efficiency of the named entity can be further improved.

Referring to fig. 4, in some embodiments, step S303 may include, but is not limited to, steps S401 to S402:

step S401, acquiring part-of-speech categories of candidate words;

step S402, screening target words from a plurality of candidate words based on part-of-speech categories.

The following describes the above steps S401 to S402 in detail.

In step S401 of some embodiments, part-of-speech categories include nouns, adjectives, verbs, adverbs, and the like. Specifically, a long-short term memory model (long-short term memory, LSTM) may be utilized to identify part of speech for each candidate word. Firstly, carrying out word vectorization on candidate words by using a long-short-term memory model to obtain candidate word vectors; then, the long-term and short-term memory model performs part-of-speech recognition on the word sequence formed based on the candidate word vectors, and outputs the part-of-speech category of each candidate word.

In step S402 of some embodiments, candidate words with a part-of-speech category meeting the requirements are selected as target words according to the part-of-speech category of each candidate word. Specifically, in order to improve accuracy of named entity recognition, candidate words with part of speech class being nouns are generally selected as target words.

Through the steps S401 to S402, candidate words can be screened according to the part-of-speech category of the candidate words and the scene characteristics of the medical scene of specific application, and a part of the candidate words are used as target words, so that the number of the target words is effectively reduced, the calculation efficiency of the model is improved, and the recognition efficiency of the named entities is further improved.

Referring to fig. 5, in some embodiments, step S104 may include, but is not limited to, steps S501 to S503:

step S501, bilinear attention calculation is carried out on the text characteristics and the vocabulary data of the medical record based on the characteristic fusion network, and a second attention calculation result is obtained;

step S502, carrying out semantic enhancement on the second attention calculation result based on the medical record text characteristics to obtain intermediate text characteristics;

step S503, the intermediate text features are standardized to obtain the fusion text features.

The following describes the above steps S501 to S503 in detail.

In step S501 of some embodiments, bilinear attention computation is performed on the medical record text feature and the vocabulary data based on the feature fusion network, so that attention computation is sufficiently performed on each target word in the medical record text feature and the vocabulary data, different weights are given to each target word, and weighted fusion of the medical record text feature and the target word is realized, so as to obtain a second attention computation result.

In step S502 of some embodiments, when the second attention computation result is semantically enhanced based on the medical record text feature, the medical record text feature and the second attention computation result are feature fused, so that semantic information contained in the medical record text feature is added to the second attention computation result, thereby obtaining an intermediate text feature.

In step S503 of some embodiments, when the intermediate text feature is normalized, the intermediate text feature is scaled, so that the scaled text feature is in a fixed feature interval, and feature standard processing is implemented, so as to obtain a fused text feature with a more standardized expression form.

Through the steps S501 to S503, vocabulary knowledge can be introduced in the named entity recognition process, the requirement on manual labeling can be greatly reduced, and the most suitable vocabulary and medical record text features can be adaptively selected to be fused according to different contexts and different scenes in the named entity recognition process, so that the feature quality and feature content comprehensiveness of the fused text features are improved, and the accuracy of named entity recognition of the electronic medical record is further improved.

Referring to fig. 6, in some embodiments, step S501 includes, but is not limited to, steps S601 to S605:

Step S601, performing feature multiplication on the text features of the medical records and the vocabulary data to obtain a first feature matrix;

step S602, performing sum pooling treatment on the first feature matrix to obtain a second feature matrix;

step S603, vectorizing the second feature matrix to obtain bilinear feature vectors;

step S604, carrying out normalization processing on bilinear feature vectors to obtain normalized features;

step S605, attention calculation is performed on the normalized features based on a preset function, and a second attention calculation result is obtained.

The following describes the above steps S601 to S605 in detail.

In step S601 of some embodiments, when performing feature multiplication on the medical record text feature and the vocabulary data, firstly, a transpose operation is performed on the medical record text feature to obtain a transposed result, and then, feature multiplication is performed on each target word in the transposed result and the vocabulary data to obtain a plurality of first feature matrices.

In step S602 of some embodiments, first, pooling is performed on each first feature matrix to obtain a pooling result of the first feature matrix; and then, summing the pooling results of all the first feature matrixes to obtain a second feature matrix.

In step S603 of some embodiments, vectorizing the second feature matrix, mapping the second feature matrix to a preset vector space, and obtaining a bilinear feature vector with a fixed dimension.

In step S604 of some embodiments, the bilinear feature vector is normalized, and the bilinear feature vector is limited to a fixed value range, so as to obtain a normalized feature.

In step S605 of some embodiments, the preset function may be a softmax function. The specific implementation process of obtaining the second attention calculation result by performing the attention calculation on the normalized feature based on the preset function is similar to the attention calculation process of the above formula (1). For the sake of space saving, the description is omitted.

Through the steps S601 to S605, the most suitable vocabulary and the text feature of the medical record can be adaptively selected to be fused according to different contexts and different scenes in the named entity recognition process, so that the feature fusion accuracy is realized.

In step S105 of some embodiments, when entity extraction is performed on the fused text feature based on the second branch network to obtain the fused text entity feature, the network structure of the second branch network is consistent with the network structure of the first branch network. Based on this, entity extraction is performed on the fused text feature based on the second branch network, and the specific implementation process of obtaining the fused text entity feature is similar to the specific implementation process of the above steps S201 to S203. For the sake of space saving, the description is omitted.

Referring to fig. 7, in some embodiments, step S106 may include, but is not limited to, steps S701 to S702:

step S701, entity type scoring is carried out on the fused text entity characteristics based on the recognition network, and type scoring data of the fused text entity characteristics are obtained;

step S702, obtaining the entity type of the fused text entity characteristic according to the type grading data.

The following describes the above steps S701 to S702 in detail.

The recognition network is constructed based on a conditional random field algorithm, and for an input feature vector, a predicted sequence corresponding to the feature vector is input, the recognition network can obtain the probability generated by the predicted sequence by scoring the predicted sequence, and the prediction labeling sequence when the likelihood function of the probability generated by the predicted sequence is maximum is calculated as output.

In step S701 of some embodiments, when the recognition network performs entity type scoring on the fused text entity features, a prediction sequence Y corresponding to the fused text entity features X is used; and scoring the predicted sequence Y by using an identification network to obtain the probability of the predicted sequence Y. Wherein, the scoring process for identifying the network is represented as formula (5):

Wherein S (X, Y) represents type scoring data of the predicted sequence Y, A represents a transition score matrix, n represents the number of words in the medical record text data, yi represents an ith entity type tag of the predicted sequence Y, A _yi,yi+1 The score representing the transition of the entity type label yi to the entity type label yi+1, P is the score matrix output by the upper layer, P _i,yi The score representing the yi-th entity type label of the i-th word, and the probability of the predicted sequence generation is represented as shown in formula (6):

wherein,,representing the predicted sequence->Type scoring data for (a); />Is a true labeling sequence, Y _X Representing all possible annotation sequences.

In step S702 of some embodiments, the predicted sequence is screened according to the type scoring data, and the predicted sequence with the highest type scoring data is used as the optimal predicted tag sequence, so that the entity type in the predicted tag sequence is used as the entity type of the fused text entity feature.

Through the steps S701 to S702, named entity recognition can be performed in a score quantization manner, the predicted sequence with the highest type score data is used as the optimal predicted tag sequence, and then the entity type in the predicted tag sequence is used as the entity type fusing the text entity characteristics, so that the accuracy of named entity recognition can be improved.

The embodiment of the application discloses a method for identifying an electronic medical record naming entity, which comprises the steps of obtaining electronic medical record data; wherein the electronic medical record data comprises medical record text data; and performing word segmentation processing on the medical record text data to obtain a medical record word sequence, so that named entity recognition can be performed on the electronic medical record on a word level. Further, feature extraction is carried out on the medical record word sequence through a first branch network of a preset named entity recognition model to obtain the text feature of the medical record, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network; the medical record text features and the pre-acquired vocabulary data are fused based on the feature fusion network to obtain the fused text features, vocabulary knowledge can be introduced in the named entity recognition process, the requirement for manual labeling can be greatly reduced, the most suitable vocabulary and the medical record text features can be adaptively selected for fusion according to different contexts and different scenes in the named entity recognition process, and the feature quality and feature content comprehensiveness of the fused text features are improved. Further, entity extraction is carried out on the fused text features based on the second branch network to obtain fused text entity features, the fused text entity features can be extracted in a relatively aspect, finally, entity identification is carried out on the fused text entity features based on the identification network to obtain entity types of the fused text entity features, the specific types of the named entity features can be accurately identified, and accuracy of identifying the named entity of the electronic medical record is improved.

Referring to fig. 8, an embodiment of the present application further provides an electronic medical record named entity identifying device, which may implement the above electronic medical record named entity identifying method, where the device includes:

a data obtaining module 801, configured to obtain electronic medical record data, where the electronic medical record data includes medical record text data;

the word segmentation module 802 is configured to perform word segmentation on the medical record text to obtain a word sequence of the medical record;

the feature extraction module 803 is configured to perform feature extraction on a medical record word sequence through a first branch network of a preset named entity recognition model to obtain a text feature of a medical record, where the named entity recognition model includes a feature fusion network, a second branch network, and a recognition network;

the fusion module 804 is configured to perform fusion processing on the medical record text feature and the pre-acquired vocabulary data based on the feature fusion network, so as to obtain a fused text feature;

the entity extraction module 805 is configured to perform entity extraction on the fused text feature based on the second branch network, so as to obtain a fused text entity feature;

the entity recognition module 806 is configured to perform entity recognition on the fused text entity feature based on the recognition network, so as to obtain an entity type of the fused text entity feature.

The specific implementation manner of the electronic medical record named entity recognition device is basically the same as the specific implementation manner of the electronic medical record named entity recognition method, and is not repeated here.

The embodiment of the application also provides electronic equipment, which comprises: the electronic medical record naming entity identification method comprises a memory, a processor, a program stored in the memory and capable of running on the processor and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the electronic medical record named entity recognition method for executing the embodiments of the present disclosure;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the electronic medical record named entity identification method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a method for identifying a named entity of an electronic medical record, a device for identifying the named entity of the electronic medical record, electronic equipment and a computer readable storage medium, wherein the method comprises the steps of acquiring data of the electronic medical record; wherein the electronic medical record data comprises medical record text data; and performing word segmentation processing on the medical record text data to obtain a medical record word sequence, so that named entity recognition can be performed on the electronic medical record on a word level. Further, feature extraction is carried out on the medical record word sequence through a first branch network of a preset named entity recognition model to obtain the text feature of the medical record, wherein the named entity recognition model comprises a feature fusion network, a second branch network and a recognition network; the medical record text features and the pre-acquired vocabulary data are fused based on the feature fusion network to obtain the fused text features, vocabulary knowledge can be introduced in the named entity recognition process, the requirement for manual labeling can be greatly reduced, the most suitable vocabulary and the medical record text features can be adaptively selected for fusion according to different contexts and different scenes in the named entity recognition process, and the feature quality and feature content comprehensiveness of the fused text features are improved. Further, entity extraction is carried out on the fused text features based on the second branch network to obtain fused text entity features, the fused text entity features can be extracted in a relatively aspect, finally, entity identification is carried out on the fused text entity features based on the identification network to obtain entity types of the fused text entity features, the specific types of the named entity features can be accurately identified, and accuracy of identifying the named entity of the electronic medical record is improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting on the embodiments of the application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. The method for identifying the named entity of the electronic medical record is characterized by comprising the following steps:

2. The method for identifying a named entity of an electronic medical record according to claim 1, wherein the feature extraction of the sequence of medical record words through the first branch network of the preset named entity identification model to obtain text features of the medical record comprises:

3. The method for identifying a named entity of an electronic medical record according to claim 1, wherein the fusing processing is performed on the text feature of the medical record and the pre-acquired vocabulary data based on the feature fusion network to obtain a fused text feature, and the method comprises the following steps:

4. The method for identifying a named entity of an electronic medical record according to claim 3, wherein the performing bilinear attention calculation on the text feature of the medical record and the vocabulary data based on the feature fusion network to obtain a second attention calculation result comprises:

vectorizing the second feature matrix to obtain a bilinear feature vector;

normalizing the bilinear feature vector to obtain normalized features;

5. The method for identifying a named entity of an electronic medical record according to any one of claims 1 to 4, further comprising obtaining the vocabulary data, specifically comprising:

dividing the medical record text data to obtain text characters;

6. The method of claim 5, wherein the screening the target term from the plurality of candidate terms comprises:

Acquiring part-of-speech categories of the candidate words;

7. The method for identifying a named entity of an electronic medical record according to any one of claims 1 to 4, wherein the performing entity identification on the fused text entity feature based on the identification network to obtain an entity type of the fused text entity feature includes:

8. An electronic medical record named entity recognition device, characterized in that the device comprises:

9. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the electronic medical record named entity recognition method of any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the electronic medical record named entity identification method of any one of claims 1 to 7.