CN113191151A

CN113191151A - Medical named entity one-word multi-label recognition method and device and electronic equipment

Info

Publication number: CN113191151A
Application number: CN202110617009.7A
Authority: CN
Inventors: 张瀚之; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-07-30

Abstract

The invention relates to a medical named entity one-word multi-label recognition method, a device and electronic equipment, which relate to the technical field of deep learning and comprise the following steps: performing fine-grained word segmentation on a text to be recognized; performing information fusion on the fine-grained word segmentation to obtain a coded word vector; and outputting the identification information corresponding to the text to be identified according to the coded word vector. The method can enable the Chinese fine-grained words to have the function of fusing context information, and can solve the problem of one word and multiple labels of named entities in the medical field.

Description

Medical named entity one-word multi-label recognition method and device and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of deep learning, in particular to a medical named entity one-word multi-label recognition method, a medical named entity one-word multi-label recognition device and electronic equipment.

Background

The sequence labeling problem is a basic problem of NLP, and the result of label prediction has a crucial influence on subsequent tasks. In addition, many tasks in natural language processing can also be translated into sequence tagging problems, such as named entity recognition, word segmentation, part-of-speech tagging, and the like.

However, in the identification process of named entities in the medical field, the word multi-label problem is often encountered, such as the word glucose, which is labeled as a medicine together with "solution" in "treatment of hypoglycemia in patients with glucose solution injection", and the word "blood test in patients: the index of glucose of 5.73mmol/L is marked as the examination index, the index of glucose self-drinking after the patient exercises is not marked in the index of glucose self-drinking after the patient exercises, the semantic meaning of the words is not obviously changed, but the corresponding labels are different under different contexts.

Although the existing LSTM model for solving the problem of sequence labeling can handle one-word ambiguity to some extent, the LSTM model is encoded at a word level, the minimum encoding unit in the encoding process is one word, and word vector encoding is performed at the word level, but the conventional LSTM model cannot well solve the one-word multi-label problem at the word level in the medical field.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a method and an apparatus for recognizing a word with multiple labels for a medical named entity, and an electronic device, so as to solve the problem that an existing recognition model for a named entity has a poor effect on a word-level word with multiple labels.

According to a first aspect of the present disclosure, there is provided a medical named entity one-word multi-label recognition method, including: performing fine-grained word segmentation on a text to be recognized; performing information fusion on the fine-grained word segmentation to obtain a coded word vector; and outputting the identification information corresponding to the text to be identified according to the coded word vector.

Further, the fine-grained word segmentation of the text to be recognized includes: performing fine-grained word segmentation on the text to be recognized by utilizing a binary grammar model and a named entity word library, and splitting the text to be recognized into a plurality of unit word groups; wherein each unit phrase at least comprises one word.

Further, the named entity thesaurus comprises a medical professional knowledge base.

Further, performing information fusion on the fine-grained participles to obtain a coded word vector, including: and mapping each fine-grained participle to a vector space by using a character-level language model to obtain a coded word vector corresponding to each fine-grained participle.

Further, mapping the fine-grained participles to a vector space, that is, performing information fusion on the participles, includes: and splicing the tail word forward hidden state and the first word backward hidden state of each unit word group to fuse the information of each fine-grained participle in the context and the information of each fine-grained participle.

Further, outputting the identification information corresponding to the text to be identified according to the encoding word vector, including: and marking the coded word vectors by using a sequence marking model, and outputting label prediction of each coded word vector.

According to a second aspect of the present disclosure, there is also provided a medical named entity one-word multi-target recognition apparatus, including: the word segmentation module is used for performing fine-grained word segmentation on the text to be recognized; the encoding word vector module is used for carrying out information fusion on the fine-grained word segmentation to obtain an encoding word vector; and the identification module is used for outputting identification information corresponding to the text to be identified according to the coded word vector.

According to a third aspect of the present disclosure, there is also provided an electronic device comprising a memory for storing a computer program and a processor; the processor is adapted to execute the computer program to implement the method according to the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect of the present disclosure.

The method has the advantages that the binary grammar model is used for fusing the medical professional knowledge base, fine-grained word segmentation is carried out on the text, information fusion is carried out on the fine-grained word segmentation, and the coded word vector is obtained, so that the fine-grained word of the Chinese has the function of fusing context information and the internal information of the word group, and the problem of one word and multiple marks in the medical field can be solved.

Other features of embodiments of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which is to be read in connection with the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the embodiments of the disclosure.

FIG. 1 is a flow chart illustrating a method for recognizing a medical named entity in a word with multiple labels;

FIG. 2 is a diagram illustrating the difficulty of migrating from English to Chinese in the prior art and the solution;

fig. 3 is a schematic diagram of a process of performing word vector encoding according to the present embodiment;

FIG. 4 is an overall process for named entity recognition, and an overall architectural diagram, according to the present example;

FIG. 5 is a schematic structural diagram of a medical named entity multi-label recognition device;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

One application scenario of the embodiments of the present disclosure is a scenario of one word of a named entity in the medical field and multiple labels, for example, the word glucose is labeled as a drug together with "solution" in "treatment of hypoglycemia in patients by injecting glucose solution," and the word glucose is labeled as a drug in "blood test in patients: the glucose of 5.73mmol/L is an inspection index, the patient does not mark the self-drinking glucose water after exercise, the semantic meaning of the word is not obviously changed, but the corresponding labels are different under different contexts; similarly, the 24h is marked as time in the patient 24h symptom relief, the 24h electrocardiogram is marked as examination content in the 24h electrocardiogram abnormality detection, and the meanings are different under different contexts, however, the prior art carries out word vector coding at the word level, can process word ambiguity to a certain extent, but cannot solve the problem of word level word ambiguity, especially the boundary problem.

Referring to fig. 2(a), fig. 2(a) is a schematic diagram of a process of encoding word vectors for an english sentence according to a conventional model, where each word in english has a space therebetween, and a word can express the meaning of a word, and the interior of a word in english can be fused with context information by using the technique because of changes of tense/simple complex number and the like caused by context influence. As shown in fig. 2(a), by recognizing "George Washington cut the tree", more accurate tag prediction can be performed for each word, that is, each word.

However, Chinese is two-dimensional, and the Chinese is composed of strokes/radicals as units, and then words are composed to form phrases to reconstruct sentences, which makes the model difficult to use in the Chinese field. For example, referring to fig. 2(b), fig. 2(b) is a schematic diagram of a process for recognizing a chinese sentence by using a conventional model, when a chinese character is input in the model, if a chinese character corresponds to an english word, and a stroke or a pinyin corresponds to a letter, since the interior of the chinese character does not change as if the english is affected by the context, the model cannot achieve the effect of fusing the context and the internal information of the word in the word vector as in the english field, while the prior art BiLSTM performs word vector mapping according to the word, and only can perform label prediction on a single word, and even cannot perform label prediction well when a word has multiple meanings, the conventional model has a poor recognition effect on one word and multiple labels in the chinese medical field.

In view of the technical problems of the above embodiments, the inventor proposes a method for recognizing a word-of-medical named entity with multiple labels, referring to fig. 1, the method includes the following steps S1-S3.

S1, performing fine-grained word segmentation on the text to be recognized;

in the embodiment, a binary grammar model is adopted to fuse a named entity word stock, fine-grained word segmentation is firstly carried out on the text to be recognized, the text to be recognized is segmented into a plurality of unit word groups according to the named entity word stock,

in a feasible example, assuming that the text to be recognized is 'self-drinking glucose solution', the text to be recognized is divided into four unit phrases 'self, drinking, glucose and solution' according to the named entity word stock.

Each unit phrase at least contains one character, that is, each unit phrase can be a character or a word composed of a plurality of characters.

In this embodiment, the named entity thesaurus includes a medical professional knowledge base, and when the named entity thesaurus is used in other fields, the named entity thesaurus may be other professional knowledge bases, which is not limited herein.

S2, carrying out information fusion on the fine-grained participles to obtain a coded word vector;

in the present embodiment, each fine-grained participle obtained in step S1 is mapped to a vector space by using a character-level language model, so as to obtain a coded word vector corresponding to each fine-grained participle.

The character-level language model in the embodiment can be realized by using a Bi-LSTM model, and because the Bi-LSTM can dynamically map phrases to a vector space by fusing semantic information of context, the problem of word ambiguity can be well solved. Meanwhile, the character-level language model obtained by training can be well embedded into downstream tasks according to different input dynamic coding word vectors.

Specifically, the character-level language model can be trained as a sub-network, where the joint probability distribution of the character string (character) sequences in the character-level language model is defined as the continuous product of the conditional probabilities of the words:

wherein:

h_t(X₀：t-1)＝f_h(X_t-1,h_t-1,c_t-1；θ)

c_t(x₀：t-1)＝f_c(x_t-1,h_t-1,c_t-1；θ)

the probability distribution of each word in the above equation is determined by the Softmax function. Where θ is the model parameter, i.e., parameters V and b in the softmax function. h is_t、c_tThe output value of the LSTM is passed for each word.

Hidden state h per word obtained from the above formula_tReferring to fig. 3, in forming the coded word vector, a forward hidden state of a last word and a backward hidden state of a first word of each unit word group are performedAnd splicing is carried out to fuse the information of each fine-grained participle in the context and the information of each fine-grained participle per se, so that the common problem of one word and multiple labels in the medical field can be well solved, and meanwhile, unknown words and rare words generated due to shorthand of doctors, new medicines and the like in the medical field can be better processed.

Specifically, the forward hidden state corresponding to the last word of each unit phrase is utilized

Backward hidden state corresponding to the first word of each unit phrase

Is spliced to obtain

Wherein the content of the first and second substances,

representing a forward hidden state of the radio communication system,

representing a backward-hidden state of the mobile device,

and representing the fused vector, and confirming that the unit phrase is the unit phrase conforming to the context through the semantic information implied by each unit phrase, thereby ensuring the meaning accuracy of each unit phrase in the whole sentence.

As shown in the example of fig. 4, the character-level language model is used to splice the tail word forward hidden state and the first word backward hidden state of each word in "self, drink, glucose, solution" respectively, so as to obtain the encoded word vector of each fine-grained phrase in this sentence.

And S3, outputting the identification information corresponding to the text to be identified according to the coded word vector.

The present embodiment tags the encoded word vectors of S2 using a sequence tagging model, and outputs a tag prediction for each encoded word vector. Wherein the sequence marker model may be LSTM + CRF.

Referring to fig. 4, when the sequence tagging model of this embodiment identifies "self-drinking glucose solution", a character-level language model is first used to perform information fusion on each fine-grained phrase in "self, drinking, glucose, solution" respectively, so as to obtain a coding word vector of these phrases in this sentence. And then processing the coding word vector of each unit word group of 'self, drink, glucose and solution' by using a sequence mark model, identifying that the 'glucose' is B-drug and the 'solution' is E-drug, identifying that the 'self' and the 'drink' are O, and outputting a processing result as label prediction of each fine-grained word. The problem of a word of multiple marks in the medical field can be solved.

The embodiment of the present invention further provides a medical named entity one-word multi-label recognition apparatus 200, referring to fig. 5, including:

the word segmentation module 201 is configured to perform fine-grained word segmentation on a text to be recognized, and is configured to solve the method in S1 in the foregoing embodiment, and details are not described here again to avoid repetition.

The encoding word vector module 202 is configured to perform information fusion on the fine-grained word segmentation to obtain an encoding word vector, and is used to solve the method related to S2 in the foregoing embodiment, and details are not described here again to avoid repetition.

The identifying module 203 is configured to output identification information corresponding to the text to be identified according to the encoding word vector, and is configured to solve the method in S3 in the foregoing embodiment, and details are not repeated here to avoid repetition.

Referring to fig. 2(c), in the embodiment of the present invention, a binary grammar model is used to fuse a medical professional knowledge base, fine-grained word segmentation is performed on a text, a chinese "character" is mapped to an english letter, the fine-grained word is mapped to an english "character", a phrase formed by combining several fine-grained words is mapped to an english phrase, and then the phrase is mapped to a sentence, and a different character change occurs in the interior of one chinese fine-grained word due to context, so that the fine-grained word has the characteristics of an original technical english word, the effect of fusing context information and the phrase own internal information is achieved, and the problem of word ambiguity in the chinese medical field can be solved.

An electronic device 400 according to an embodiment of the present invention is provided, and referring to fig. 6, the electronic device includes a memory 402 and a processor 401, where the memory 402 is used for storing a computer program; the processor 401 is adapted to execute the computer program to implement a method for word multi-target recognition of a medical named entity.

The modules of the electronic device may be implemented by a computer program stored in a processor executing memory in the embodiment, or may be implemented by other circuit structures, which is not limited herein.

The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for recognizing the medical named entity by one word and multiple labels is realized.

A skilled person can design a computer program according to the solution of the embodiments of the present disclosure. How the computer program controls the processor to operate is well known in the art and will not be described in detail here.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A medical named entity one-word multi-label recognition method is characterized by comprising the following steps:

performing fine-grained word segmentation on a text to be recognized;

performing information fusion on the fine-grained word segmentation to obtain a coded word vector;

and outputting the identification information corresponding to the text to be identified according to the coded word vector.

2. The medical named entity one-word multi-label recognition method as claimed in claim 1, wherein the fine-grained word segmentation of the text to be recognized comprises:

performing fine-grained word segmentation on the text to be recognized by utilizing a binary grammar model and a named entity word library, and splitting the text to be recognized into a plurality of unit word groups;

wherein each unit phrase at least comprises one word.

3. The method of claim 2, wherein the named entity thesaurus comprises a medical expertise repository.

4. The medical named entity one-word multi-label recognition method as claimed in claim 1, wherein the information fusion of the fine-grained word segmentation is performed to obtain a coded word vector, and the method comprises:

and mapping each fine-grained participle to a vector space by using a character-level language model to obtain a coded word vector corresponding to each fine-grained participle.

5. The medical named entity one-word multi-label recognition method as claimed in claim 4, wherein information fusion is performed on the fine-grained participles to obtain a coded word vector, further comprising:

and splicing the forward hidden state of the tail word and the backward hidden state of the first word of each unit word group to fuse the information of each fine-grained word in the context and the information of each fine-grained word.

6. The method for recognizing the word with the multiple labels of the medical named entity according to claim 1, wherein outputting the recognition information corresponding to the text to be recognized according to the coded word vector comprises:

and marking the coded word vectors by using a sequence marking model, and outputting label prediction of each coded word vector.

7. A medical named entity word multi-label recognition device, comprising:

the word segmentation module is used for performing fine-grained word segmentation on the text to be recognized;

the encoding word vector module is used for carrying out information fusion on the fine-grained word segmentation to obtain an encoding word vector;

and the identification module is used for outputting identification information corresponding to the text to be identified according to the coded word vector.

8. An electronic device comprising a memory and a processor, the memory for storing a computer program; the processor is adapted to execute the computer program to implement the method according to any of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.