WO2019015369A1 - 用于识别医疗文本中的医疗实体的方法和装置 - Google Patents

用于识别医疗文本中的医疗实体的方法和装置 Download PDF

Info

Publication number
WO2019015369A1
WO2019015369A1 PCT/CN2018/084214 CN2018084214W WO2019015369A1 WO 2019015369 A1 WO2019015369 A1 WO 2019015369A1 CN 2018084214 W CN2018084214 W CN 2018084214W WO 2019015369 A1 WO2019015369 A1 WO 2019015369A1
Authority
WO
WIPO (PCT)
Prior art keywords
tag
label
target word
word
words
Prior art date
Application number
PCT/CN2018/084214
Other languages
English (en)
French (fr)
Inventor
张振中
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to EP18826482.4A priority Critical patent/EP3657359A4/en
Priority to JP2018567241A priority patent/JP7043429B2/ja
Priority to US16/316,468 priority patent/US11586809B2/en
Publication of WO2019015369A1 publication Critical patent/WO2019015369A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present disclosure relates to the field of medical data processing techniques and, in particular, to methods and apparatus for identifying medical entities in medical text.
  • Medical entities typically contain words related to drugs, problems (including diseases and symptoms), examinations and treatments. Medical entities include continuous medical entities (medical entities consisting of consecutive words) and non-continuous medical entities (medical entities consisting of non-continuous words).
  • Embodiments described herein provide a method and apparatus for identifying medical entities in medical text.
  • a method for identifying a medical entity in a medical text is provided.
  • the medical text is divided into a plurality of words.
  • Each of the plurality of words is used as the target word.
  • the local annotation feature and the global annotation feature of the target word are determined, wherein the local annotation feature includes the target word, and the global annotation feature includes the relationship between the target word and the identified medical entity.
  • the label of the target word is determined from the plurality of candidate tags.
  • a combined relationship of the target word and a word located before the target word is obtained, the combined relationship including combination and non-combination.
  • the combined words are then identified as medical entities based on the combined relationship.
  • the plurality of candidate tags include a first tag, a second tag, a third tag, a fourth tag, and a fifth tag.
  • the first tag is configured to indicate that the word is a shared beginning portion of a medical entity.
  • the second tag is configured to indicate that the word is a non-shared beginning portion of the medical entity.
  • the third tag is configured to indicate that the word is a contiguous portion of the medical entity.
  • the fourth tag is configured to indicate that the word is part of a non-medical entity and to instruct an operation to identify the medical entity.
  • the fifth tag is configured to indicate that the word is part of a non-medical entity and to indicate that an operation to identify the medical entity is not performed.
  • the local labeling feature and the global labeling feature based on the target word Calculating, for each of the plurality of candidate tags, a probability that the candidate tag is a tag of the target word. Then, the candidate tag having the greatest probability is determined as the tag of the target word.
  • the probability is calculated using a maximum entropy model.
  • the tag is the first tag, Not combining the label with a previous label of the label; if the label is the third label, combining the label with a previous first label, a second label, or a third label of the label;
  • the tag is the second tag, and then: determining a combined feature of the target word and a combined feature of a word of the target word having a first tag or a second tag, wherein the combined feature includes a corresponding a morpheme included in the word; calculating a tag combination probability and a tag non-combination probability based on a combined feature of the target word and a combined feature of a word of the target word having the first tag or the second tag;
  • the label combination probability is greater than the label non-combination probability, and the second label is combined with the previous first label or the second label of the second label;
  • the label combination probability is not
  • the local annotation feature further includes X words preceding the target word, and X words after the target word, where X is a natural number.
  • the local annotation feature further includes part of speech of the target word, part of speech of the X words preceding the target word, and part of speech of the X words following the target word.
  • the global annotation feature further includes a relationship between the Y words preceding the target word and the identified medical entity and a relationship between the Y words following the target word and the identified medical entity , where Y is a natural number.
  • the global annotation feature includes: whether the target word is included in the identified medical entity, whether the first Y words of the target word are included in the identified medical entity, and Whether the last Y words of the target word are included in the identified medical entity.
  • the combined feature further includes a morpheme included in the Z words preceding the corresponding word and a morpheme included in the Z words following the corresponding word, wherein Z is a natural number.
  • the combination feature includes: a morpheme included in the corresponding word, a top Z word of the corresponding word, a last Z words of the corresponding word, a morpheme included in the first Z words of the corresponding word, and a corresponding word The morphemes contained in the last Z words.
  • the label combination probability and the label non-combination probability are calculated using a maximum entropy model.
  • the maximum entropy model is obtained by training using an optimization algorithm.
  • an apparatus for identifying a medical entity in a medical text includes at least one processor and at least one memory storing a computer program.
  • the computer program is executed by the at least one processor, causing the apparatus to: divide the medical text into a plurality of words; respectively, using each of the plurality of words as a target word, performing the following on the target word An operation: determining a local annotation feature and a global annotation feature of the target word, wherein the local annotation feature includes the target word, the global annotation feature including a relationship between the target word and an identified medical entity; Determining, by the local annotation feature of the target word, the global annotation feature, determining a label of the target word from the plurality of candidate labels; acquiring the target word and the target word before the target word based on the label of the target word A combination relationship of words including a combination and a non-combination; the combined words are identified as medical entities according to the combination relationship.
  • a computer readable storage medium storing a computer program.
  • the computer program when executed by the processor, implements the steps described above for identifying a medical entity in a medical text.
  • FIG. 1 is a flowchart of a method for identifying a medical entity in a medical text, in accordance with an embodiment of the present disclosure
  • FIG. 2 is an exemplary flowchart of a process of determining a tag of the target word from a plurality of candidate tags in the embodiment shown in FIG. 1;
  • FIG. 3 is an exemplary flowchart specifically describing a process of determining a combination relationship of words in the embodiment shown in FIG. 1;
  • FIG. 4 is a schematic block diagram of an apparatus for identifying a medical entity in a medical text, in accordance with an embodiment of the present disclosure.
  • Embodiments of the present disclosure describe a method of medical entity identification using an English medical text as an example, however, those skilled in the art should understand a method of identifying medical text in other languages (eg, Chinese) using a method according to an embodiment of the present disclosure. And devices also fall within the scope of the present disclosure.
  • FIG. 1 illustrates a flow chart of a method for identifying a medical entity in a medical text, in accordance with an embodiment of the present disclosure.
  • the medical text is divided into a plurality of words.
  • medical text can be divided into multiple words based on spaces between words.
  • the input medical text can be lexically analyzed using natural language processing techniques to divide the medical text into a plurality of words.
  • koji analysis algorithms and tools such as a conditional random field algorithm and a word segmentation tool stanford-segmenter provided by Stanford University may be employed to perform pre-processing of medical text.
  • the punctuation is also treated as a word.
  • step S104 each of the plurality of words is used as the target word, and for each target word, the local annotation feature and the global annotation feature of the target word are determined.
  • the local annotation feature may include a target word.
  • the local annotation feature may also include X words preceding the target word, and X words following the target word.
  • the local annotation feature may further include the part of speech of the target word and the part of speech of the X words before the target word and the part of speech of the X words after the target word.
  • X is a natural number.
  • the local annotation features include: the target word, the first three words of the target word, the last three words of the target word, the part of the target word, the first three words of the target word, and the part of the last three words of the target word. .
  • the global annotation feature can include a relationship of the target term to the identified medical entity.
  • the identified medical entity may include medical entities that have been identified in the current medical text, and may also include medical entities that have been identified in other medical texts.
  • the global annotation feature may further include a relationship between the Y words preceding the target word and the identified medical entity and a relationship between the Y words following the target word and the identified medical entity.
  • Y is a natural number.
  • the global annotation feature includes whether the target word is included in the identified medical entity, whether the previous word of the target word is included in the identified medical entity, and whether the latter word of the target word is included in the identified medical entity. .
  • step S106 the label of the target word is determined from the plurality of candidate tags based on the local annotation feature and the global annotation feature of the target word.
  • the plurality of candidate tags may include, for example, a first tag HB, a second tag DB, a third tag I, a fourth tag OY, and a fifth tag ON.
  • the first tag HB is used to indicate that the word is a shared beginning portion of a medical entity.
  • the second tag DB is used to indicate that the word is a non-shared beginning portion of the medical entity.
  • the third tag I is used to indicate that the word is a continuous part of the medical entity.
  • the fourth tag OY is used to indicate that the word is part of a non-medical entity and to instruct an operation to identify the medical entity.
  • the fifth tag ON is used to indicate that the word is part of a non-medical entity and indicates that the operation of identifying the medical entity is not performed.
  • each of the plurality of candidate tags may be obtained by the maximum entropy model based on the local tagging feature and the global tagging feature of the target word. The probability of one.
  • the maximum entropy model can be expressed, for example, as follows:
  • w i represents a parameter of the N-dimensional column vector, 1 ⁇ i ⁇ K, K represents the total number of tags, x represents an N-dimensional feature vector of the corresponding target word, and c i represents an i-th tag (in the present embodiment, c 1 Indicates the first label, c 2 represents the second label, and so on, and p(c i
  • x includes local annotation features and global annotation features of the target words. The number of parameters in the local annotation feature and the global annotation feature determines the size of the dimension N of x. In the case where one or more parameters in the local annotation feature and the global annotation feature do not exist, the parameters that do not exist are represented by a null symbol.
  • w i can be trained by minimizing the following objective function (2):
  • W ⁇ R K ⁇ N represents a parameter matrix
  • the i-th row vector in W is represented as w i
  • g j represents the label corresponding to the j-th training word
  • p(g j ) represents the corresponding j-th training word
  • M is the number of training words
  • is the coefficient of the L2 regular term, ⁇ >0.
  • a parameter matrix W having an initialization value may be set in advance, and x corresponding to the M training words for training may be substituted into formula (1) to obtain p(g j ) of M training words. Then, an optimization algorithm is used to obtain the updated W. The process of updating W is repeated until the value of the element in W tends to be stable, and the process of training w i ends. The W thus obtained will be used for the formula (1).
  • the optimization algorithm may employ the Mini-batched AdaGrad algorithm.
  • step S204 the probability p(c i
  • step S206 the candidate tag c i having the maximum probability p(c i
  • step S108 based on the tag of the target word, the combined relationship of the target word and the word located before the target word is acquired.
  • the combined relationship includes combination and no combination.
  • Fig. 3 is a more specific example of the embodiment shown in Fig. 1, showing in detail a process of determining a combination relationship of words (S108).
  • step S302 it is determined in step S302 whether the current tag is the fourth tag OY.
  • the fourth label OY indicates that the target word is part of a non-medical entity and indicates the operation to identify the medical entity. If the tag of the target word indicates an operation to identify the medical entity, then a determination is made that the tag corresponding to the tag of the target word can be a combination of individual tags that are part of the medical entity. Each of the tags indicating that the corresponding word can be part of the medical entity includes, for example, a first tag HB, a second tag DB, and a third tag 1. If it is determined that the current tag is not the fourth tag OY (NO in step S302), the process returns to step S106 to continue determining the tag of the next word.
  • step S302 If it is determined that the current tag is the fourth tag OY (YES in step S302), the combination of the respective tags before the fourth tag OY is determined.
  • step S304 the previous label of the current label is taken as the current label. For example, if the current label is the fourth label OY, the previous label of the fourth label OY is taken as the current label.
  • step S306 it may be determined whether the current tag is the third tag 1.
  • the third label I indicates that its corresponding word is a continuous part of the medical entity, so if it is determined that the current label is the third label I ("YES" in step S306), then the current label and the previous label of the current label are combined in step S308.
  • the previous label is one of the first label HB, the second label DB, and the third label 1.
  • the first label HB indicates that its corresponding word is the first part of the medical entity. Only one first tag HB may exist in a medical entity.
  • step S320 the tag combination can be considered to have been completed. Therefore, if it is determined that the tag combined with the current tag is the first tag HB (YES in step S320), then in step S324, the combination relationship of the words corresponding to the tag is determined according to the combination of the tags. If it is determined that the tag combined with the current tag is not the first tag HB (NO in step S320), the process returns to step S304 to continue processing the previous tag.
  • step S306 If it is determined that the current tag is not the third tag I (NO in step S306), it proceeds to step S310 to determine whether the current tag is the second tag DB. If it is determined that the current tag is not the second tag DB (NO in step S310), then returning to step S304 to continue processing the previous tag.
  • the second tag DB indicates that its corresponding word is the non-shared beginning of the medical entity.
  • the term may be the first part of the medical entity (ie, it does not need to be combined with its previous first label HB or second label DB), or it may not be the first part of the medical entity (ie, it needs to be with it before)
  • the first label HB or the second label DB combination Therefore, if the tag is the second tag DB, it is necessary to determine whether to combine the second tag DB with the first tag HB or the second tag DB of the previous one.
  • step S312 the word corresponding to the second tag DB (ie, the target word) is determined.
  • the combined feature may include a morpheme included in the corresponding word.
  • the morphemes here refer to Chinese characters.
  • the combined feature may further include a morpheme included in the Z words preceding the corresponding word and a morpheme included in the Z words following the corresponding word.
  • Z is a natural number.
  • the combined features include: the morphemes included in the corresponding words, the first three words of the corresponding words, the last three words of the corresponding words, the morphemes contained in the first three words of the corresponding words, and the last three words of the corresponding words.
  • the morpheme included in the corresponding words, the first three words of the corresponding words, the last three words of the corresponding words, the morphemes contained in the first three words of the corresponding words, and the last three words of the corresponding words.
  • step S314 based on the combined feature of the word corresponding to the second tag DB and the combination feature of the word corresponding to the previous first tag HB or the second tag DB of the second tag DB, the tag combination probability and the tag non-combination probability are calculated.
  • the maximum entropy model may be adopted based on a combination feature of a word corresponding to the second tag DB and a combination feature of a word corresponding to the previous first tag HB or the second tag DB of the second tag DB. (ie, using equation (1)) to calculate the label combination probability and the label non-combination probability.
  • w i represents the parameters of the N-dimensional column vector (for the label labeling process and the label combination process, the elements in W may be different, and the number of rows K and the number of columns of W N may also be different).
  • x denotes an N-dimensional feature vector corresponding to the target word, which includes the combined feature of the target word and the combined feature of the previous word of the target word having the first tag HB or the second tag DB. If one or more parameters in the combined feature do not exist, the parameter that does not exist is represented by a null symbol.
  • c i indicates whether or not the label combination is performed (in the present embodiment, c 1 indicates that the label combination is performed, and c 2 indicates that the label combination is not performed. In an alternative embodiment, c 1 may be used to indicate that no label combination is performed, and c 2 indicates that label combination is performed. Label combination). K is 2.
  • x) indicates whether or not the probability of label combination is performed in the state corresponding to x. In the present embodiment, p(c 1
  • the parameters used in the maximum entropy model for calculating the label combination probability and the label non-combination probability may also be trained by minimizing the objective function (2).
  • the difference from the equation (2) used in the process of label determination is that g j here indicates whether the j-th training words are combined.
  • step S316 it is determined whether the label combination probability is greater than the label non-combination probability. If it is determined that the label combination probability is not greater than the label non-combination probability (NO in step S316), the second label DB and the previous first label HB or the second label DB of the second label DB are not combined. The process then proceeds to step S322, where it is determined whether the tag determined in combination with the current tag is the first tag HB or is empty.
  • step S3108 If it is determined that the label combination probability is greater than the label non-combination probability (YES in step S316), then in step S318, the second label DB and the previous first label HB or the second label DB of the second label DB are combined. After the operation of step S318 is performed, the process proceeds to step S322.
  • step S324 If it is determined that the label judged in combination with the current label is the first label HB or is empty (YES in step S322), then in step S324, the combination relationship of the words corresponding to the label is determined according to the combination of the labels. If it is determined that the tag combined with the current tag is not the first tag HB or is empty (NO in step S322), then returning to step S304 to continue processing the previous tag.
  • step S110 the process proceeds to step S110, according to the determined combination of words, a plurality of words of the medical text are combined or not combined, and the combined words are recognized as medical entities.
  • the medical entity identified from the combination of the first label HB and the third label I or the combination of the words corresponding to the combination of the second label DB and the third label I is a continuous medical entity.
  • the medical entity identified from the combination of words corresponding to the combination of the first label HB and the second label DB is a non-continuous medical entity.
  • step S306 it is first determined in step S306 whether the current tag is the third tag I, and then it is determined in step S310 whether the current tag is the second tag DB.
  • step S310 it may be determined first whether the current label is the second label DB, and then determining whether the current label is the third label I.
  • the combination of the labels is determined by combining the labels from the back to the front (ie, from the fourth label OY forward).
  • the combination of tags may also be determined by combining the tags from the back (ie, starting from the previous first tag HB of the fourth tag OY).
  • the determination of the tag and the combination of the tags can be implemented by transferring the model.
  • the transfer model is a model for describing state transitions, and includes, for example, not limited to, a Markov model, a hidden Markov model, an N-ary model, a neural network model, and the like.
  • the transfer model moves from one state to another through actions.
  • the state in the transition model is ⁇ L, E>.
  • L represents the sequence formed by each tag
  • E represents the identified medical entity.
  • the actions in the transfer model are, for example, ⁇ HB, DB, I, OY, ON ⁇ . In the case where the action is OY, medical entity recognition is also performed.
  • the probability that the candidate tag is the tag of the target word is calculated by the maximum entropy model, and the tag of the target word is determined by determining the maximum probability.
  • This tag indicates the action that needs to be performed in the current state. The action based on the current state and the need to move to the next state.
  • the action indication identifies the medical entity
  • the tag combination probability and the tag non-combination probability are calculated by the maximum entropy model, and the combination of the tags is determined by comparing the tag combination probability with the tag non-combination probability.
  • the medical entity identified by the combined tag as a parameter in the state of the transfer model helps to determine the next action to be performed.
  • a method for identifying a medical entity in a medical text is capable of identifying a continuous medical entity and a non-continuous medical entity.
  • the pipeline can be avoided.
  • the defect of the error transmission in the mode achieves higher accuracy of medical entity recognition.
  • the process of medical entity identification is exemplified below by taking "EGD showed hiatal hernia and laceration in distal esophagus.” as an example.
  • the example sentence "EGD showed hiatal hernia and laceration in distal esophagus.” is divided into a plurality of words in order.
  • the words include punctuation. Therefore, the example sentence can be divided into ten words: “EGD”, “showed”, “hiatal”, “hernia”, “and”, “laceration”, “in”, “distal”, “esophagus”, and “.” . Then mark the corresponding words in these ten words.
  • the part of ".” is represented by a null symbol.
  • "EGD" has no words before, so words before "EGD” are represented by empty symbols.
  • a transfer model is employed to model and implement the determination of the tag sequence and the combination of tags.
  • the state in the transfer model is ⁇ L, E>.
  • L represents the sequence formed by each tag
  • E represents the identified medical entity.
  • the set of actions in the transfer model is ⁇ HB, DB, I, OY, ON ⁇ .
  • the actions in the transfer model represent actions that need to be taken (such as labeling the next word or identifying a medical entity, etc.) in accordance with the current state in order to reach the next state.
  • Table 1 shows the relationship between state and action in the transfer model (where ⁇ EOS> indicates the end of the transfer process). In Table 1, the serial numbers are for illustrative purposes only and are not part of the transfer model.
  • c 1 denotes a first label HB
  • c 2 denotes a second label DB
  • c 3 denotes a third label I
  • c 4 denotes a fourth label OY
  • c 5 denotes a fifth label ON.
  • x represents an N-dimensional feature vector including a local annotation feature of "EGD" and a global annotation feature.
  • the label I of "hernia” which represents a continuous part of the medical entity
  • the label I of "hernia” is combined with the label DB of "hiatal”. This obtains the label combination DB, I.
  • the word combination "hiatal hernia” corresponding to the tag combination DB, I is identified as a medical entity (the medical entity is a continuous medical entity).
  • the label "ON” is obtained, the label DB of "distal”, and the label I of "esophagus".
  • its label is judged as OY (that is, the action corresponding to state S7 is OY).
  • the transfer model performs the action of the medical entity recognition.
  • the label HB of "laceration” it represents the first part of the medical entity, so it is not combined with its previous label.
  • the label "ON” of "in” it indicates that the target word is a part of the non-medical entity and indicates that the operation of identifying the medical entity is not performed, so it can be directly judged that the label before it is not combined.
  • x denotes an N-dimensional feature vector corresponding to "distal”, which includes a combined feature of "distal” and a combined feature of a word of the previous "distal” having a first tag or a second tag (here, a combined feature of "laceration” ).
  • P1 is larger than P2
  • the combined feature is related to the morphemes contained in the corresponding words and the preceding and following words of the corresponding words, for example, in the case where "laceration" is miswritten as "lacerasion", the combination probability of "lacera” morpheme and “distal” can also help improve The combined probability of "lacerasion” and "distal”.
  • the label I of "esophagus” which represents a continuous part of the medical entity, the label I of "esophagus” is combined with the label DB of "distal".
  • FIG. 4 shows a schematic block diagram of an apparatus 400 for identifying a medical entity in a medical text, in accordance with an embodiment of the present disclosure.
  • the apparatus 400 can include a processor 410 and a memory 420 that stores a computer program.
  • apparatus 400 is enabled to perform the steps of the method for identifying a medical entity in a medical text as shown in FIG. That is, device 400 can divide the medical text into a plurality of words. Each of the plurality of words is used as the target word.
  • the local annotation feature and the global annotation feature of the target word are determined, wherein the local annotation feature includes the target word, and the global annotation feature includes the relationship between the target word and the identified medical entity.
  • the label of the target word is determined from the plurality of candidate tags. Then, based on the tag of the target word, a combined relationship of the target word and a word located before the target word is obtained, the combined relationship including combination and non-combination. The combined words are then identified as medical entities based on the combined relationship.
  • processor 410 may be, for example, a central processing unit CPU, a microprocessor, a digital signal processor (DSP), a multi-core based processor architecture processor, or the like.
  • Memory 420 can be any type of memory implemented using text storage technology, including but not limited to random access memory, read only memory, semiconductor based memory, flash memory, magnetic disk memory, and the like.
  • device 400 may also include an input device 430, such as a keyboard, mouse, etc., for inputting medical text. Additionally, device 400 can also include an output device 440, such as a display or the like, for outputting the identified medical entity.
  • an input device 430 such as a keyboard, mouse, etc.
  • an output device 440 such as a display or the like, for outputting the identified medical entity.
  • the device 400 determines a label of the target word from the plurality of candidate tags based on the local annotation feature and the global annotation feature of the target word by: local annotation feature and global based on the target word An annotation feature that calculates, for each of the plurality of candidate tags, a probability that the candidate tag is a tag of the target term. Then, the candidate tag having the greatest probability is determined as the tag of the target word.
  • the device 400 acquires a combined relationship of the target word and a word located before the target word based on the tag of the target word by the following operation. If the tag is the first tag HB, the tag is not combined with the previous tag of the tag. If the tag is the third tag I, the tag is combined with the previous first tag HB or the second tag DB or the third tag I of the tag. If the tag is the second tag DB, determining a combination feature of the target word and a combination feature of the previous word of the target word having the first tag HB or the second tag DB, wherein the combination feature includes the corresponding word included morpheme.
  • the tag combination probability and the tag non-combination probability are then calculated based on the combined feature of the target word and the combined feature of the previous word of the target word having the first tag HB or the second tag DB.
  • the second label DB and the previous first label HB or the second label DB of the second label DB are combined in response to the label combination probability being greater than the label non-combination probability.
  • the second label DB and the previous first label HB or the second label DB of the second label DB are not combined in response to the label combination probability being not greater than the label non-combination probability.
  • the tag is the fourth tag OY, the tag and the previous tag of the tag are not combined, and an operation of identifying the medical entity is performed.
  • a computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the medical device for identifying medical text as shown in FIG. The steps of the entity's method.

Abstract

一种用于识别医疗文本中的医疗实体的方法和装置。在该方法中,将医疗文本分成多个词语(S102)。分别以多个词语中的每一个词语作为目标词语。对目标词语,确定该目标词语的局部标注特征和全局标注特征(S104),其中,局部标注特征包括该目标词语,全局标注特征包括该目标词语与已识别医疗实体的关系。接着,基于该目标词语的局部标注特征和全局标注特征,从多个候选标签中确定该目标词语的标签(S106)。然后,基于该目标词语的标签,获取该目标词语与位于该目标词语之前的词语的组合关系(S108),所述组合关系包括组合和不组合。然后,根据该组合关系将所组合的词语识别为医疗实体(S110)。

Description

用于识别医疗文本中的医疗实体的方法和装置
相关申请的交叉引用
本申请要求于2017年07月20日递交的中国专利申请第201710594503.X号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开涉及医疗数据处理技术领域,具体地,涉及用于识别医疗文本中的医疗实体的方法和装置。
背景技术
随着医疗信息技术的发展,出现了大量可用的电子医疗文本(例如电子病历和体检报告等)。这些医疗文本用于支持临床决策系统。然而,由于电子医疗文本大部分采用自然语言撰写,因此,电子医疗文本中的有用信息无法被依赖于结构化数据的临床决策系统直接使用。为充分利用电子医疗文本,能够从自然语言中抽取结构化数据的自然语言处理技术在临床医学领域受到了广泛关注。作为临床自然语言处理的一项基本任务,医疗实体的识别一直备受医学界关注。医疗实体通常包含与药物、问题(包括疾病和症状)、检查和治疗相关的词语。医疗实体包括连续医疗实体(由连续词语组成的医疗实体)和非连续医疗实体(由非连续词语组成的医疗实体)。
发明内容
本文中描述的实施例提供了一种用于识别医疗文本中的医疗实体的方法和装置。
根据本公开的第一方面,提供了一种用于识别医疗文本中的医疗实体的方法。在该方法中,将医疗文本分成多个词语。分别以多个词语中的每 一个词语作为目标词语。对目标词语,确定该目标词语的局部标注特征和全局标注特征,其中,局部标注特征包括该目标词语,全局标注特征包括该目标词语与已识别医疗实体的关系。接着,基于该目标词语的局部标注特征和全局标注特征,从多个候选标签中确定该目标词语的标签。然后,基于该目标词语的标签,获取该目标词语与位于该目标词语之前的词语的组合关系,所述组合关系包括组合和不组合。然后,根据该组合关系将所组合的词语识别为医疗实体。
在本公开的实施例中,多个候选标签包括第一标签、第二标签、第三标签、第四标签和第五标签。第一标签被配置为指示所述词语是医疗实体的共享开始部分。第二标签被配置为指示所述词语是医疗实体的非共享开始部分。第三标签被配置为指示所述词语是医疗实体连续的一部分。第四标签被配置为指示所述词语是非医疗实体的一部分并指示执行识别医疗实体的操作。第五标签被配置为指示所述词语是非医疗实体的一部分并指示不执行识别医疗实体的操作。
在本公开的实施例中,在基于该目标词语的局部标注特征和全局标注特征,从多个候选标签中确定该目标词语的标签的步骤中,基于该目标词语的局部标注特征和全局标注特征,对于多个候选标签中的每一个计算该候选标签是该目标词语的标签的概率。然后,将具有最大概率的候选标签确定为该目标词语的标签。
在本公开的实施例中,该概率使用最大熵模型计算。
在本公开的实施例中,在基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系的步骤中,如果所述标签是所述第一标签,则不组合所述标签与所述标签的前一标签;如果所述标签是所述第三标签,则组合所述标签与所述标签的前一个第一标签、第二标签或第三标签;如果所述标签是所述第二标签,则:确定所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词语的组合特征,其中,所述组合特征包括对应词语所包含的词素;基于所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词 语的组合特征,计算标签组合概率和标签不组合概率;响应于所述标签组合概率大于标签不组合概率,组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;响应于所述标签组合概率不大于标签不组合概率,不组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;如果所述标签是所述第四标签,则不组合所述标签与所述标签的前一标签,并执行识别医疗实体的操作;如果所述标签是所述第五标签,则不组合所述标签与所述标签的前一标签;以及根据所述标签的组合确定所述标签对应的词语的组合关系。
在本公开的实施例中,所述局部标注特征还包括在所述目标词语之前的X个词语,以及在所述目标词语之后的X个词语,其中,X为自然数。
在本公开的实施例中,所述局部标注特征还包括所述目标词语的词性,在所述目标词语之前的X个词语的词性以及在所述目标词语之后的X个词语的词性。
在本公开的实施例中,所述全局标注特征还包括在所述目标词语之前的Y个词语与已识别医疗实体的关系以及在所述目标词语之后的Y个词语与已识别医疗实体的关系,其中,Y为自然数。
在本公开的实施例中,所述全局标注特征包括:所述目标词语是否包含在已识别的医疗实体中、所述目标词语的前Y个词语是否包含在已识别的医疗实体中、以及所述目标词语的后Y个词语是否包含在已识别的医疗实体中。
在本公开的实施例中,组合特征还包括在对应词语之前的Z个词语所包含的词素以及在对应词语之后的Z个词语所包含的词素,其中,Z为自然数。
在本公开的实施例中,组合特征包括:对应词语所包含的词素、对应词语的前Z个词语、对应词语的后Z个词语、对应词语的前Z个词语所包含的词素、以及对应词语的后Z个词语所包含的词素。
在本公开的实施例中,标签组合概率和标签不组合概率使用最大熵模型计算。
在本公开的实施例中,所述最大熵模型采用优化算法通过训练获得。
根据本公开的第二方面,提供了一种用于识别医疗文本中的医疗实体的装置。该装置包括至少一个处理器和存储有计算机程序的至少一个存储器。当所述计算机程序由所述至少一个处理器执行时使得所述装置:将医疗文本分成多个词语;分别以所述多个词语中的每一个词语作为目标词语,对所述目标词语执行以下操作:确定所述目标词语的局部标注特征和全局标注特征,其中,所述局部标注特征包括所述目标词语,所述全局标注特征包括所述目标词语与已识别医疗实体的关系;基于所述目标词语的所述局部标注特征和所述全局标注特征,从多个候选标签中确定所述目标词语的标签;基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系,所述组合关系包括组合和不组合;根据该组合关系将所组合的词语识别为医疗实体。
根据本公开的第三方面,提供了一种存储有计算机程序的计算机可读存储介质。计算机程序在由处理器执行时实现上述的用于识别医疗文本中的医疗实体的方法的步骤。
附图说明
为了更清楚地说明本公开的实施例的技术方案,下面将对实施例的附图进行简要说明,应当知道,以下描述的附图仅仅涉及本公开的一些实施例,而非对本公开的限制,其中:
图1是根据本公开的实施例的用于识别医疗文本中的医疗实体的方法的流程图;
图2是在图1所示的实施例中的从多个候选标签中确定该目标词语的标签的过程的示例性流程图;
图3是主要针对在图1所示的实施例中的确定词语的组合关系的过程进行具体描述的示例性流程图;
图4是根据本公开的实施例的用于识别医疗文本中的医疗实体的装置的示意性框图。
具体实施方式
为了使本公开的实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本公开的实施例的技术方案进行清楚、完整的描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域技术人员在无需创造性劳动的前提下所获得的所有其它实施例,也都属于本公开保护的范围。
除非另外定义,否则在此使用的所有术语(包括技术和科学术语)具有与本公开主题所属领域的技术人员所通常理解的相同含义。进一步将理解的是,诸如在通常使用的词典中定义的那些的术语应解释为具有与说明书上下文和相关技术中它们的含义一致的含义,并且将不以理想化或过于正式的形式来解释,除非在此另外明确定义。
目前,医疗实体的识别方法主要关注对连续医疗实体的识别,然而,在实际的医疗文本中也存在非连续医疗实体。例如,在医疗文本“EGD showed hiatal hernia and laceration in distal esophagus.”中,“hiatal hernia”和“laceration distal esophagus”是两个医疗实体。“hiatal hernia”是连续医疗实体,而“laceration distal esophagus”是非连续医疗实体。
本公开的实施例以英文医疗文本为例说明医疗实体识别的方法,然而本领域的技术人员应了解,采用根据本公开的实施例的方法来识别其它语言(例如,中文)的医疗文本的方法和装置也落入本公开的保护范围之内。
图1示出根据本公开的实施例的用于识别医疗文本中的医疗实体的方法的流程图。
如图1所示,在步骤S102,将医疗文本分成多个词语。对于英文医疗文本,可以根据词语之间的空格来将医疗文本分成多个词语。对于中文医疗文本,在本发明的实施例中,可以利用自然语言处理技术对输入的医疗文本进行词法分析以便将医疗文本分成多个词语。在本公开的实施例中,可以采取例如条件随机场算法、斯坦福大学提供的分词工具 stanford-segmenter等词法分析算法和工具来完成对医疗文本的预处理。在一些实施例中,如果医疗文本包括标点符号,则标点符号也被视为词语。
在步骤S104,分别以多个词语中的每一个词语作为目标词语,并对于每一个目标词语,确定该目标词语的局部标注特征和全局标注特征。
在本公开的实施例中,局部标注特征可以包括目标词语。局部标注特征还可以包括在目标词语之前的X个词语,以及在所述目标词语之后的X个词语。进一步地,局部标注特征还可以包括目标词语的词性以及在目标词语之前的X个词语的词性以及在所述目标词语之后的X个词语的词性。这里,X为自然数。例如,局部标注特征包括:目标词语、目标词语的前三个词语、目标词语的后三个词语、目标词语的词性、目标词语的前三个词语的词性以及目标词语的后三个词语的词性。
在本公开的实施例中,全局标注特征可以包括目标词语与已识别医疗实体的关系。已识别医疗实体可以包括在当前医疗文本中已经识别出的医疗实体,还可以包括在其它医疗文本中已经识别出的医疗实体。进一步地,全局标注特征还可以包括在目标词语之前的Y个词语与已识别医疗实体的关系以及在所述目标词语之后的Y个词语与已识别医疗实体的关系。这里,Y为自然数。例如,全局标注特征包括:目标词语是否包含在已识别的医疗实体中、目标词语的前一个词语是否包含在已识别的医疗实体中以及目标词语的后一个词语是否包含在已识别的医疗实体中。
在步骤S106,基于目标词语的局部标注特征和全局标注特征,从多个候选标签中确定目标词语的标签。
在本公开的实施例中,多个候选标签例如可以包括:第一标签HB、第二标签DB、第三标签I、第四标签OY和第五标签ON。第一标签HB用于指示所述词语是医疗实体的共享开始部分。第二标签DB用于指示所述词语是医疗实体的非共享开始部分。第三标签I用于指示所述词语是医疗实体连续的一部分。第四标签OY用于指示所述词语是非医疗实体的一部分并指示执行识别医疗实体的操作。第五标签ON用于指示所述词语是非医疗实体的一部分并指示不执行识别医疗实体的操作。
更具体地,在图2中说明图1所示的实施例中的从多个候选标签中确定目标词语的标签的示例性过程。在如图2所示的示例中,在步骤S204,对于多个候选标签中的每一个,计算该候选标签是目标词语的标签的概率。在本公开的实施例中,可以基于目标词语的局部标注特征和全局标注特征,通过最大熵模型,获得多个候选标签(例如,第一标签至第五标签中的多个标签)中的每一个的概率。最大熵模型例如可如下表示:
Figure PCTCN2018084214-appb-000001
其中,w i表示N维列向量的参数,1≤i≤K,K表示标签的总数,x表示对应目标词语的N维特征向量,c i表示第i标签(在本实施例中,c 1表示第一标签,c 2表示第二标签,以此类推),p(c i|x)表示在x对应的状态下确定标签为c i的概率。x包括目标词语的局部标注特征和全局标注特征。局部标注特征和全局标注特征中的参数数量决定x的维度N的大小。在局部标注特征和全局标注特征中的一个或多个参数不存在的情况下,将不存在的参数用空符号来表示。
在本公开的实施例中,可以通过最小化下列目标函数(2)来训练w i
Figure PCTCN2018084214-appb-000002
其中,W∈R K×N表示参数矩阵,W中的第i行向量被表示为w i,g j表示第j个训练词语对应的标签,p(g j)表示第j个训练词语对应的标签为g j的概率,M表示训练词语的数量,λ表示L2正则项的系数,λ>0。
在训练w i的过程中,可预先设置具有初始化数值的参数矩阵W,将用于训练的M个训练词语对应的x代入公式(1),获得M个训练词语的p(g j)。然后,采用优化算法来获得更新的W。重复更新W的过程,直到W中元素的值趋于稳定,训练w i的过程结束。由此获得的W将被用于式(1)。可选择地,优化算法可以采用Mini-batched AdaGrad算法。
在步骤S204中,计算i取不同值的情况下的概率p(c i|x)。
在步骤S206,将具有最大概率p(c i|x)的候选标签c i确定为目标词语的标签。
回到图1,在步骤S108,基于目标词语的标签,获取目标词语与位于目标词语之前的词语的组合关系。该组合关系包括组合和不组合。
图3是图1所示的实施例的更具体的示例,详细示出了确定词语的组合关系的过程(S108)。
在步骤S106中确定了目标词语的标签之后,在步骤S302,确定当前标签是否为第四标签OY。第四标签OY指示目标词语是非医疗实体的一部分并指示执行识别医疗实体的操作。如果目标词语的标签指示执行识别医疗实体的操作,则确定在目标词语的标签之前的指示对应词语能够成为医疗实体的一部分的各个标签的组合。指示对应词语能够成为医疗实体的一部分的各个标签例如包括:第一标签HB、第二标签DB和第三标签I。如果确定当前标签不是第四标签OY(步骤S302中的“否”),则返回步骤S106,继续确定下一词语的标签。
如果确定当前标签是第四标签OY(步骤S302中的“是”),则确定第四标签OY之前的各个标签的组合。
在步骤S304,将当前标签的前一标签作为当前标签。例如,如果当前标签为第四标签OY,则将第四标签OY的前一标签作为当前标签。
在步骤S306,可确定当前标签是否为第三标签I。第三标签I表示其对应的词语是医疗实体连续的一部分,因此如果确定当前标签是第三标签I(步骤S306中的“是”),则在步骤S308组合当前标签与当前标签的前一个标签,该前一个标签为第一标签HB、第二标签DB和第三标签I中的一个。然后,在步骤S320确定与当前标签组合的标签是否为第一标签HB。第一标签HB表示其对应的词语是医疗实体的最开始的部分。在一个医疗实体中仅可能存在一个第一标签HB。在标签的组合中已经包括第一标签HB的情况下,可以认为标签组合已经完成。因此,如果确定与当前标签组合的标签是第一标签HB(步骤S320中的“是”),则在步骤S324, 根据标签的组合确定标签对应的词语的组合关系。如果确定与当前标签组合的标签不是第一标签HB(步骤S320中的“否”),则返回步骤S304继续处理前一标签。
如果确定当前标签不是第三标签I(步骤S306中的“否”),则进行到步骤S310确定当前标签是否为第二标签DB。如果确定当前标签不是第二标签DB(步骤S310中的“否”),则返回步骤S304继续处理前一标签。
第二标签DB表示其对应的词语是医疗实体的非共享开始部分。该词语可能是医疗实体的最开始的部分(即,其不需要与其之前的第一标签HB或者第二标签DB组合),也可能不是医疗实体的最开始的部分(即,其需要与其之前的第一标签HB或者第二标签DB组合)。因此如果该标签是第二标签DB,则需要判断是否组合该第二标签DB与其前一个的第一标签HB或者第二标签DB。
在图3所示的实施例中,如果确定当前标签是第二标签DB(步骤S310中的“是”),则在步骤S312,确定该第二标签DB对应的词语(即,目标词语)的组合特征和该第二标签DB的前一个第一标签HB或者第二标签DB对应的词语的组合特征。如果该第二标签DB之前没有第一标签HB或第二标签DB,则将该第二标签DB之前的标签定义为空。
在本公开的实施例中,组合特征可以包括对应词语所包含的词素(morpheme)。对于中文医疗文本,这里的词素指的是中文字。进一步地,组合特征还可以包括在对应词语之前的Z个词语所包含的词素以及在对应词语之后的Z个词语所包含的词素。这里,Z为自然数。这样,即使医疗文本中存在拼写错误(对于中文医疗文本,存在错别字)的情况下,组合特征仍然能够根据对应词语及对应词语的之前词语中包含的正确的词素和对应词语的之后词语中包含的正确的词素来提供用于标签组合的信息。例如,组合特征包括:对应词语所包含的词素、对应词语的前三个词语、对应词语的后三个词语、对应词语的前三个词语所包含的词素以及对应词语的后三个词语所包含的词素。
在步骤S314,基于该第二标签DB对应的词语的组合特征和该第二标签DB的前一个第一标签HB或者第二标签DB对应的词语的组合特征,计算标签组合概率和标签不组合概率。在本公开的实施例中,可以基于该第二标签DB对应的词语的组合特征和该第二标签DB的前一个第一标签HB或者第二标签DB对应的词语的组合特征,通过最大熵模型(即,使用式(1))来计算标签组合概率和标签不组合概率。在这种情况下,在式(1)中,w i表示N维列向量的参数(对于标签标注过程和标签组合过程,W中的元素可能是不同的,并且W的行数K和列数N也可能是不同的)。1≤i≤K。x表示对应目标词语的N维特征向量,其包括目标词语的组合特征和目标词语的前一个具有第一标签HB或者第二标签DB的词语的组合特征。如果组合特征中的一个或多个参数不存在,则将不存在的参数用空符号来表示。c i表示是否进行标签组合(在本实施例中,c 1表示进行标签组合,c 2表示不进行标签组合。在替代实施例中,也可以使用c 1表示不进行标签组合,c 2表示进行标签组合)。K为2。p(c i|x)表示在x对应的状态下是否进行标签组合的概率。在本实施例中,p(c 1|x)表示在x对应的状态下进行标签组合的概率,p(c 2|x)表示在x对应的状态下不进行标签组合的概率。在替代实施例中,也可以使用p(c 1|x)表示在x对应的状态下不进行标签组合的概率,p(c 2|x)表示在x对应的状态下进行标签组合的概率。
在本公开的实施例中,用于计算标签组合概率和标签不组合概率的最大熵模型中使用的参数也可以通过最小化目标函数(2)来训练。与标签确定的过程中使用的式(2)不同的是,在这里g j表示第j个训练词语是否进行组合。
在步骤S316,确定标签组合概率是否大于标签不组合概率。如果确定标签组合概率不大于标签不组合概率(步骤S316中的“否”),则不组合该第二标签DB与该第二标签DB的前一个第一标签HB或者第二标签DB。然后过程进行到步骤S322,确定与当前标签进行组合判断的标签是否为第一标签HB或为空。
如果确定标签组合概率大于标签不组合概率(步骤S316中的 “是”),则在步骤S318,组合该第二标签DB与该第二标签DB的前一个第一标签HB或者第二标签DB。执行完步骤S318的操作之后,过程进行到步骤S322。
如果确定与当前标签进行组合判断的标签是第一标签HB或为空(步骤S322中的“是”),则在步骤S324,根据标签的组合确定标签对应的词语的组合关系。如果确定与当前标签组合的标签不是第一标签HB或为空(步骤S322中的“否”),则返回步骤S304继续处理前一标签。
执行完步骤S324之后,过程进行到步骤S110,根据所确定的词语的组合关系,对医疗文本的多个词语进行组合或不组合,并将所组合的词语识别为医疗实体。在步骤S110中,从第一标签HB与第三标签I的组合或者第二标签DB与第三标签I的组合所对应的词语的组合识别出的医疗实体为连续医疗实体。从第一标签HB与第二标签DB的组合所对应的词语的组合识别出的医疗实体为非连续医疗实体。
在图3所示的实施例中,先在步骤S306确定当前标签是否为第三标签I,然后在步骤S310确定当前标签是否为第二标签DB。可替代地,也可以先确定当前标签是否为第二标签DB,然后再确定当前标签是否为第三标签I。此外,在图3所示的实施例中,采用从后往前(即,从第四标签OY往前)组合标签的方式来确定标签的组合。本领域的技术人员应了解,在替代实施例中,也可以采用从前往后(即,从第四标签OY的前一个第一标签HB开始往后)组合标签的方式来确定标签的组合。
在本公开的实施例中,标签的确定和标签的组合可以通过转移模型来实现。转移模型是用于描述状态转换的模型,例如包括并不限于,马尔可夫模型、隐马尔可夫模型、N元模型、神经网络模型等。转移模型通过动作从一个状态转移到另一个状态。在本公开的实施例中,转移模型中的状态为<L,E>。L表示各个标签形成的序列,E表示已识别的医疗实体。转移模型中的动作例如为{HB,DB,I,OY,ON}。在动作为OY的情况下,还进行医疗实体识别。
具体地,在转移模型中,通过最大熵模型来计算候选标签是目标词语 的标签的概率,再通过确定最大的概率来确定目标词语的标签。该标签表示在当前状态下需要进行的动作。基于当前状态和需要进行的动作转移到下一个状态。在动作指示识别医疗实体的情况下,通过最大熵模型来计算标签组合概率和标签不组合概率,再通过比较标签组合概率和标签不组合概率来确定标签的组合。通过组合的标签识别出的医疗实体作为转移模型的状态中的一个参数,有助于判断下一个需要进行的动作。
根据本公开实施例的用于识别医疗文本中的医疗实体的方法能够识别连续医疗实体和非连续医疗实体。此外,在本公开的实施例中,由于采用了联合式方式(在标签标注的过程中考虑词语的局部标注特征和全局标注特征,并且考虑标注和组合之间的相互关系),因此可以避免管道式方式(只利用局部标注特征完成标签标注之后,再进行标签组合)中的错误传递的缺陷,实现更高的医疗实体识别的准确率。
下面以“EGD showed hiatal hernia and laceration in distal esophagus.”为例,对医疗实体识别的过程进行示例性说明。
首先,将例句“EGD showed hiatal hernia and laceration in distal esophagus.”按顺序分成多个词语。在本公开的实施例中,词语包括标点符号。因此,该例句可被分成十个词语:“EGD”、“showed”、“hiatal”、“hernia”、“and”、“laceration”、“in”、“distal”、“esophagus”以及“.”。再对这十个词语分别标注相应的词性。“.”的词性,用空符号来表示。此外,例如“EGD”之前没有词语,因此“EGD”之前的词语用空符号来表示。
然后,依次对这十个词语进行处理。在本示例中,采用转移模型来建模和实现标签序列的确定及标签的组合。转移模型中的状态为<L,E>。L表示各个标签形成的序列,E表示已识别的医疗实体。转移模型中的动作的集合为{HB,DB,I,OY,ON}。转移模型中的动作表示依据当前状态,转移模型需要进行的动作(例如给下一个词标注标签或者识别医疗实体等)以便达到下一个状态。表1示出转移模型中状态与动作的关系(其中<EOS>表示转移过程结束)。在表1中,序号仅用于说明的目的,并不 属于转移模型的一部分。
表1
Figure PCTCN2018084214-appb-000003
对于“EGD”,通过式(1)计算候选标签是“EGD”的标签的概率P1=p(c 1|x)、P2=p(c 2|x)、P3=p(c 3|x)、P4=p(c 4|x)、P5=p(c 5|x)。在这里,c 1表示第一标签HB,c 2表示第二标签DB,c 3表示第三标签I,c 4表示第四标签OY,c 5表示第五标签ON。x表示一个包括“EGD”的局部标注特征和全局标注特征的N维特征向量。在x中,局部标注特征和全局标注特征中的词素被转换成相应的N个数值。比较P1、P2、P3、P4和P5,得到P5的值最大。因此,确定“EGD”的标签是第五标签ON。
类似地,可获得“showed”的标签ON,“hiatal”的标签DB,“hernia”的标签I。此时,转移模型处于状态S2。
对于“and”,其标签被判断为OY(即,状态S2对应的动作为OY)。在转移模型采用动作OY的情况下,转移模型进行医疗实体识别的动作。对于“EGD”和“showed”的标签ON,其表示所述目标词语是非医疗实体的一部分并指示不执行识别医疗实体的操作,因此可以直接判断不组合它们与它们之前的标签。对于“hiatal”的标签DB,其表示医疗实体的非共享开始部分。因为“hiatal”之前不存在为HB的标签,因此该DB不与 其之前的标签组合。对于“hernia”的标签I,其表示医疗实体连续的一部分,因此将“hernia”的标签I与“hiatal”的标签DB组合。这样获得标签组合DB、I。将标签组合DB、I对应的词语组合“hiatal hernia”识别为医疗实体(该医疗实体为连续医疗实体)。
接着,对于“laceration”,通过式(1)计算P1=p(c 1|x)、P2=p(c 2|x)、P3=p(c 3|x)、P4=p(c 4|x)、P5=p(c 5|x)。通过比较P1、P2、P3、P4和P5,得到P1的值最大。因此判断“laceration”的标签是第一标签HB。
类似地,获得“in”的标签ON,“distal”的标签DB,“esophagus”的标签I。对于“.”,其标签被判断为OY(即,状态S7对应的动作为OY)。在转移模型采用动作OY的情况下,转移模型进行医疗实体识别的动作。对于“laceration”的标签HB,其表示医疗实体的最开始的部分,因此其不与其之前的标签组合。对于“in”的标签ON,其表示所述目标词语是非医疗实体的一部分并指示不执行识别医疗实体的操作,因此可以直接判断不组合其与其之前的标签。对于“distal”的标签DB,其表示医疗实体的非共享开始部分,因此需要判断是否将“distal”的标签DB与其之前的第一标签HB或者第二标签DB(在这里为“laceration”的标签HB)组合。在这里,仍然使用式(1)来计算标签组合概率P1=p(c 1|x)和标签不组合概率P2=p(c 2|x)。c 1表示进行标签组合,c 2表示不进行标签组合。x表示对应“distal”的N维特征向量,其包括“distal”的组合特征和“distal”的前一个具有第一标签或者第二标签的词语的组合特征(在这里为“laceration”的组合特征)。在P1大于P2的情况下,判断需要组合“laceration”的标签HB与“distal”的标签DB。由于组合特征与对应词语及对应词语的前后词语中包含的词素相关,因此,例如在“laceration”被错写成“lacerasion”的情况下,“lacera”词素与“distal”的组合概率也能够帮助提高“lacerasion”与“distal”的组合概率。
对于“esophagus”的标签I,其表示医疗实体连续的一部分,因此将“esophagus”的标签I与“distal”的标签DB组合。
这样获得标签组合HB、DB、I。将标签组合HB、DB、I对应的词语 组合“laceration distal esophagus”识别为医疗实体(该医疗实体为非连续医疗实体)。
图4示出根据本公开的实施例的用于识别医疗文本中的医疗实体的装置400的示意性框图。如图4所示,该装置400可包括处理器410和存储有计算机程序的存储器420。当计算机程序由处理器410执行时,使得装置400可执行如图1所示的用于识别医疗文本中的医疗实体的方法的步骤。也就是说,装置400可以将医疗文本分成多个词语。分别以多个词语中的每一个词语作为目标词语。对目标词语,确定该目标词语的局部标注特征和全局标注特征,其中,局部标注特征包括该目标词语,全局标注特征包括该目标词语与已识别医疗实体的关系。接着,基于该目标词语的局部标注特征和全局标注特征,从多个候选标签中确定该目标词语的标签。然后,基于该目标词语的标签,获取该目标词语与位于该目标词语之前的词语的组合关系,所述组合关系包括组合和不组合。然后,根据该组合关系将所组合的词语识别为医疗实体。
在本公开的实施例中,处理器410可以是例如中央处理单元CPU、微处理器、数字信号处理器(DSP)、基于多核的处理器架构的处理器等。存储器420可以是使用文本存储技术实现的任何类型的存储器,包括但不限于随机存取存储器、只读存储器、基于半导体的存储器、闪存、磁盘存储器等。
此外,在本公开的实施例中,装置400也可包括输入设备430,例如键盘、鼠标等,用于输入医疗文本。另外,装置400还可包括输出设备440,例如显示器等,用于输出所识别的医疗实体。
在本公开的实施例中,装置400通过以下操作来基于该目标词语的局部标注特征和全局标注特征,从多个候选标签中确定该目标词语的标签:基于该目标词语的局部标注特征和全局标注特征,对于多个候选标签中的每一个计算该候选标签是该目标词语的标签的概率。然后,将具有最大概率的候选标签确定为该目标词语的标签。
在本公开的实施例中,装置400通过以下操作来基于所述目标词语的 标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系。如果该标签是第一标签HB,则不组合该标签与该标签的前一标签。如果该标签是第三标签I,则组合该标签与该标签的前一个第一标签HB或第二标签DB或第三标签I。如果该标签是第二标签DB,则确定该目标词语的组合特征和该目标词语的前一个具有第一标签HB或者第二标签DB的词语的组合特征,其中,组合特征包括对应词语所包含的词素。然后基于该目标词语的组合特征和该目标词语的前一个具有第一标签HB或者第二标签DB的词语的组合特征,计算标签组合概率和标签不组合概率。响应于标签组合概率大于标签不组合概率,组合该第二标签DB与该第二标签DB的前一个第一标签HB或者第二标签DB。响应于标签组合概率不大于标签不组合概率,不组合该第二标签DB与该第二标签DB的前一个第一标签HB或者第二标签DB。如果所述标签是所述第四标签OY,则不组合所述标签与所述标签的前一标签,并执行识别医疗实体的操作。如果所述标签是所述第五标签ON,则不组合所述标签与所述标签的前一标签。然后,根据所述标签的组合确定所述标签对应的词语的组合关系。
在本公开的其它实施例中,还提供了一种存储有计算机程序的计算机可读存储介质,其中,计算机程序在由处理器执行时实现如图1所示的用于识别医疗文本中的医疗实体的方法的步骤。
除非上下文中另外明确地指出,否则在本文和所附权利要求中所使用的词语的单数形式包括复数,反之亦然。因而,当提及单数时,通常包括相应术语的复数。相似地,措辞“包含”和“包括”将解释为包含在内而不是独占性地。同样地,术语“包括”和“或”应当解释为包括在内的,除非本文中明确禁止这样的解释。在本文中使用术语“示例”之处,特别是当其位于一组术语之后时,所述“示例”仅仅是示例性的和阐述性的,且不应当被认为是独占性的或广泛性的。
适应性的进一步的方面和范围从本文中提供的描述变得明显。应当理解,本申请的各个方面可以单独或者与一个或多个其它方面组合实施。还应当理解,本文中的描述和特定实施例旨在仅说明的目的并不旨在限制本 申请的范围。
以上对本公开的若干实施例进行了详细描述,但显然,本领域技术人员可以在不脱离本公开的精神和范围的情况下对本公开的实施例进行各种修改和变型。本公开的保护范围由所附的权利要求限定。

Claims (27)

  1. 一种用于识别医疗文本中的医疗实体的方法,包括:
    将所述医疗文本分成多个词语;
    分别以所述多个词语中的每一个词语作为目标词语,对所述目标词语执行以下操作:
    确定所述目标词语的局部标注特征和全局标注特征,其中,所述局部标注特征包括所述目标词语,所述全局标注特征包括所述目标词语与已识别医疗实体的关系;
    基于所述目标词语的所述局部标注特征和所述全局标注特征,从多个候选标签中确定所述目标词语的标签;
    基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系,所述组合关系包括组合和不组合;
    根据所述组合关系将所组合的词语识别为医疗实体。
  2. 根据权利要求1所述的方法,其中,所述多个候选标签包括:
    第一标签,被配置为指示所述词语是医疗实体的共享开始部分;
    第二标签,被配置为指示所述词语是医疗实体的非共享开始部分;
    第三标签,被配置为指示所述词语是医疗实体连续的一部分;
    第四标签,被配置为指示所述词语是非医疗实体的一部分并指示执行识别医疗实体的操作;以及
    第五标签,被配置为指示所述词语是非医疗实体的一部分并指示不执行识别医疗实体的操作。
  3. 根据权利要求1或2所述的方法,其中,基于所述目标词语的所述局部标注特征和所述全局标注特征,从多个候选标签中确定所述目标词语的标签包括:
    基于所述目标词语的所述局部标注特征和所述全局标注特征,对于所述多个候选标签中的每一个计算该候选标签是所述目标词语的标签的概率;以及
    将具有最大概率的候选标签确定为所述目标词语的标签。
  4. 根据权利要求3所述的方法,其中,所述概率使用最大熵模型计算。
  5. 根据权利要求2所述的方法,其中,基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系包括:
    如果所述标签是所述第一标签,则不组合所述标签与所述标签的前一标签;
    如果所述标签是所述第三标签,则组合所述标签与所述标签的前一个第一标签、第二标签或第三标签;
    如果所述标签是所述第二标签,则:
    确定所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词语的组合特征,其中,所述组合特征包括对应词语所包含的词素;
    基于所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词语的组合特征,计算标签组合概率和标签不组合概率;
    响应于所述标签组合概率大于标签不组合概率,组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;
    响应于所述标签组合概率不大于标签不组合概率,不组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;
    如果所述标签是所述第四标签,则不组合所述标签与所述标签的前一标签,并执行识别医疗实体的操作;
    如果所述标签是所述第五标签,则不组合所述标签与所述标签的前一标签;以及
    根据所述标签的组合确定所述标签对应的词语的组合关系。
  6. 根据权利要求1所述的方法,其中,所述局部标注特征还包括在所述目标词语之前的X个词语,以及在所述目标词语之后的X个词语,其中,X为自然数。
  7. 根据权利要求6所述的方法,其中,所述局部标注特征还包括所述目标词语的词性,在所述目标词语之前的X个词语的词性以及在所述目标 词语之后的X个词语的词性。
  8. 根据权利要求1所述的方法,其中,所述全局标注特征还包括在所述目标词语之前的Y个词语与已识别医疗实体的关系以及在所述目标词语之后的Y个词语与已识别医疗实体的关系,其中,Y为自然数。
  9. 根据权利要求8所述的方法,其中,所述全局标注特征包括:所述目标词语是否包含在已识别的医疗实体中、所述目标词语的前Y个词语是否包含在已识别的医疗实体中、以及所述目标词语的后Y个词语是否包含在已识别的医疗实体中。
  10. 根据权利要求5所述的方法,其中,所述组合特征还包括在对应词语之前的Z个词语所包含的词素以及在对应词语之后的Z个词语所包含的词素,其中,Z为自然数。
  11. 根据权利要求10所述的方法,其中,所述组合特征包括:对应词语所包含的词素、对应词语的前Z个词语、对应词语的后Z个词语、对应词语的前Z个词语所包含的词素、以及对应词语的后Z个词语所包含的词素。
  12. 根据权利要求5所述的方法,其中,所述标签组合概率和标签不组合概率使用最大熵模型计算。
  13. 根据权利要求4或12所述的方法,其中,所述最大熵模型采用优化算法通过训练获得。
  14. 一种用于识别医疗文本中的医疗实体的装置,包括:
    至少一个处理器;以及
    存储有计算机程序的至少一个存储器;
    其中,当所述计算机程序由所述至少一个处理器执行时使得所述装置:
    将所述医疗文本分成多个词语;
    分别以所述多个词语中的每一个词语作为目标词语,对所述目标词语执行以下操作:
    确定所述目标词语的局部标注特征和全局标注特征,其中,所述局部标注特征包括所述目标词语,所述全局标注特征包括所述目标词语与 已识别医疗实体的关系;
    基于所述目标词语的所述局部标注特征和所述全局标注特征,从多个候选标签中确定所述目标词语的标签;
    基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系,所述组合关系包括组合和不组合;
    根据所述组合关系将所组合的词语识别为医疗实体。
  15. 根据权利要求14所述的装置,其中,所述多个候选标签包括:
    第一标签,被配置为指示所述词语是医疗实体的共享开始部分;
    第二标签,被配置为指示所述词语是医疗实体的非共享开始部分;
    第三标签,被配置为指示所述词语是医疗实体连续的一部分;
    第四标签,被配置为指示所述词语是非医疗实体的一部分并指示执行识别医疗实体的操作;以及
    第五标签,被配置为指示所述词语是非医疗实体的一部分并指示不执行识别医疗实体的操作。
  16. 根据权利要求14或15所述的装置,其中,所述计算机程序在由所述至少一个处理器执行时使得所述装置通过以下操作来基于所述目标词语的所述局部标注特征和所述全局标注特征,从多个候选标签中确定所述目标词语的标签:
    基于所述目标词语的所述局部标注特征和所述全局标注特征,对于所述多个候选标签中的每一个计算该候选标签是所述目标词语的标签的概率;以及
    将具有最大概率的候选标签确定为所述目标词语的标签。
  17. 根据权利要求16所述的装置,其中,所述概率使用最大熵模型计算。
  18. 根据权利要求15所述的装置,其中,所述计算机程序在由所述至少一个处理器执行时使得所述装置通过以下操作来基于所述目标词语的标签,获取所述目标词语与位于所述目标词语之前的词语的组合关系:
    如果所述标签是所述第一标签,则不组合所述标签与所述标签的前一 标签;
    如果所述标签是所述第三标签,则组合所述标签与所述标签的前一个第一标签、第二标签或第三标签;
    如果所述标签是所述第二标签,则:
    确定所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词语的组合特征,其中,所述组合特征包括对应词语所包含的词素;
    基于所述目标词语的组合特征和所述目标词语的前一个具有第一标签或者第二标签的词语的组合特征,计算标签组合概率和标签不组合概率;
    响应于所述标签组合概率大于标签不组合概率,组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;
    响应于所述标签组合概率不大于标签不组合概率,不组合所述第二标签与所述第二标签的前一个第一标签或者第二标签;
    如果所述标签是所述第四标签,则不组合所述标签与所述标签的前一标签,并执行识别医疗实体的操作;
    如果所述标签是所述第五标签,则不组合所述标签与所述标签的前一标签;以及
    根据所述标签的组合确定所述标签对应的目标词语的组合关系。
  19. 根据权利要求14所述的装置,其中,所述局部标注特征还包括在所述目标词语之前的X个词语,以及在所述目标词语之后的X个词语,其中,X为自然数。
  20. 根据权利要求19所述的装置,其中,所述局部标注特征还包括所述目标词语的词性,在所述目标词语之前的X个词语的词性以及在所述目标词语之后的X个词语的词性。
  21. 根据权利要求14所述的装置,其中,所述全局标注特征还包括在所述目标词语之前的Y个词语与已识别医疗实体的关系以及在所述目标词语之后的Y个词语与已识别医疗实体的关系,其中,Y为自然数。
  22. 根据权利要求21所述的方法,其中,所述全局标注特征包括:所述目标词语是否包含在已识别的医疗实体中、所述目标词语的前Y个词语是否包含在已识别的医疗实体中、以及所述目标词语的后Y个词语是否包含在已识别的医疗实体中。
  23. 根据权利要求18所述的装置,其中,所述组合特征还包括在对应词语之前的Z个词语所包含的词素以及在对应词语之后的Z个词语所包含的词素,其中,Z为自然数。
  24. 根据权利要求23所述的装置,其中,所述组合特征包括:对应词语所包含的词素、对应词语的前Z个词语、对应词语的后Z个词语、对应词语的前Z个词语所包含的词素、以及对应词语的后Z个词语所包含的词素。
  25. 根据权利要求18所述的装置,其中,所述标签组合概率和标签不组合概率使用最大熵模型计算。
  26. 根据权利要求17或25所述的装置,其中,所述最大熵模型采用优化算法通过训练获得。
  27. 一种存储有计算机程序的计算机可读存储介质,其中,所述计算机程序在由处理器执行时实现权利要求1至13中任一项所述的用于识别医疗文本中的医疗实体的方法的步骤。
PCT/CN2018/084214 2017-07-20 2018-04-24 用于识别医疗文本中的医疗实体的方法和装置 WO2019015369A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP18826482.4A EP3657359A4 (en) 2017-07-20 2018-04-24 METHOD AND DEVICE FOR IDENTIFYING A MEDICAL UNIT IN A MEDICAL TEXT
JP2018567241A JP7043429B2 (ja) 2017-07-20 2018-04-24 医療テキスト中の医療エンティティを識別するための方法、装置およびコンピュータ読取可能な記憶媒体
US16/316,468 US11586809B2 (en) 2017-07-20 2018-04-24 Method and apparatus for recognizing medical entity in medical text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710594503.XA CN109284497B (zh) 2017-07-20 2017-07-20 用于识别自然语言的医疗文本中的医疗实体的方法和装置
CN201710594503.X 2017-07-20

Publications (1)

Publication Number Publication Date
WO2019015369A1 true WO2019015369A1 (zh) 2019-01-24

Family

ID=65015597

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/084214 WO2019015369A1 (zh) 2017-07-20 2018-04-24 用于识别医疗文本中的医疗实体的方法和装置

Country Status (5)

Country Link
US (1) US11586809B2 (zh)
EP (1) EP3657359A4 (zh)
JP (1) JP7043429B2 (zh)
CN (1) CN109284497B (zh)
WO (1) WO2019015369A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11755680B2 (en) * 2020-01-22 2023-09-12 Salesforce, Inc. Adaptive recognition of entities
US20220246257A1 (en) * 2021-02-03 2022-08-04 Accenture Global Solutions Limited Utilizing machine learning and natural language processing to extract and verify vaccination data
CN115983270A (zh) * 2022-12-02 2023-04-18 重庆邮电大学 一种电商商品属性智能抽取方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894088A (zh) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 基于深度学习及分布式语义特征医学信息抽取系统及方法
CN106202054A (zh) * 2016-07-25 2016-12-07 哈尔滨工业大学 一种面向医疗领域基于深度学习的命名实体识别方法
CN106776711A (zh) * 2016-11-14 2017-05-31 浙江大学 一种基于深度学习的中文医学知识图谱构建方法
CN106844351A (zh) * 2017-02-24 2017-06-13 黑龙江特士信息技术有限公司 一种面向多数据源的医疗机构组织类实体识别方法及装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7865356B2 (en) * 2004-07-15 2011-01-04 Robert Bosch Gmbh Method and apparatus for providing proper or partial proper name recognition
JP2007094459A (ja) * 2005-09-27 2007-04-12 Honda Motor Co Ltd 車両緊急事態対処システムおよびプログラム
US20090249182A1 (en) 2008-03-31 2009-10-01 Iti Scotland Limited Named entity recognition methods and apparatus
US8874432B2 (en) * 2010-04-28 2014-10-28 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction
US8457950B1 (en) * 2012-11-01 2013-06-04 Digital Reasoning Systems, Inc. System and method for coreference resolution
US9129013B2 (en) * 2013-03-12 2015-09-08 Nuance Communications, Inc. Methods and apparatus for entity detection
JP6502016B2 (ja) * 2014-02-21 2019-04-17 キヤノンメディカルシステムズ株式会社 医療情報処理システムおよび医療情報処理装置
US10754925B2 (en) * 2014-06-04 2020-08-25 Nuance Communications, Inc. NLU training with user corrections to engine annotations
US10509814B2 (en) * 2014-12-19 2019-12-17 Universidad Nacional De Educacion A Distancia (Uned) System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model
CN104965992B (zh) * 2015-07-13 2018-01-09 南开大学 一种基于在线医疗问答信息的文本挖掘方法
CN107515851B (zh) * 2016-06-16 2021-09-10 佳能株式会社 用于共指消解、信息提取以及相似文档检索的装置和方法
CN106446526B (zh) * 2016-08-31 2019-11-15 北京千安哲信息技术有限公司 电子病历实体关系抽取方法及装置
CN114817386A (zh) * 2016-09-28 2022-07-29 医渡云(北京)技术有限公司 一种结构化医疗数据生成方法及装置
CN106845061A (zh) * 2016-11-02 2017-06-13 百度在线网络技术(北京)有限公司 智能问诊系统和方法
CN106844723B (zh) * 2017-02-10 2019-09-10 厦门大学 基于问答系统的医学知识库构建方法
US10380259B2 (en) * 2017-05-22 2019-08-13 International Business Machines Corporation Deep embedding for natural language content based on semantic dependencies

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894088A (zh) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 基于深度学习及分布式语义特征医学信息抽取系统及方法
CN106202054A (zh) * 2016-07-25 2016-12-07 哈尔滨工业大学 一种面向医疗领域基于深度学习的命名实体识别方法
CN106776711A (zh) * 2016-11-14 2017-05-31 浙江大学 一种基于深度学习的中文医学知识图谱构建方法
CN106844351A (zh) * 2017-02-24 2017-06-13 黑龙江特士信息技术有限公司 一种面向多数据源的医疗机构组织类实体识别方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3657359A4 *

Also Published As

Publication number Publication date
CN109284497B (zh) 2021-01-12
JP7043429B2 (ja) 2022-03-29
US20220300710A9 (en) 2022-09-22
US11586809B2 (en) 2023-02-21
JP2020527762A (ja) 2020-09-10
CN109284497A (zh) 2019-01-29
US20210342539A1 (en) 2021-11-04
EP3657359A1 (en) 2020-05-27
EP3657359A4 (en) 2021-04-14

Similar Documents

Publication Publication Date Title
WO2021000676A1 (zh) 问答方法、问答装置、计算机设备及存储介质
US20210326751A1 (en) Adversarial pretraining of machine learning models
TWI738270B (zh) 將文句短語映射至知識分類表之方法及系統
WO2019007288A1 (zh) 一种风险地址识别方法、装置以及电子设备
JP4769031B2 (ja) 言語モデルを作成する方法、かな漢字変換方法、その装置、コンピュータプログラムおよびコンピュータ読み取り可能な記憶媒体
US20110314024A1 (en) Semantic content searching
WO2021012519A1 (zh) 基于人工智能的问答方法、装置、计算机设备及存储介质
WO2021139247A1 (zh) 医学领域知识图谱的构建方法、装置、设备及存储介质
CN110991168A (zh) 同义词挖掘方法、同义词挖掘装置及存储介质
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
JP7464800B2 (ja) 小サンプル弱ラベル付け条件での医療イベント認識方法及びシステム
WO2019015369A1 (zh) 用于识别医疗文本中的医疗实体的方法和装置
US20220147835A1 (en) Knowledge graph construction system and knowledge graph construction method
WO2022174496A1 (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
US20180247016A1 (en) Systems and methods for providing assisted local alignment
JP2022109836A (ja) テキスト分類情報の半教師あり抽出のためのシステム及び方法
WO2019201024A1 (zh) 用于更新模型参数的方法、装置、设备和存储介质
CN115238026A (zh) 一种基于深度学习的医疗文本主题分割方法和装置
CN115713078A (zh) 知识图谱构建方法、装置、存储介质及电子设备
US11556706B2 (en) Effective retrieval of text data based on semantic attributes between morphemes
WO2023116572A1 (zh) 一种词句生成方法及相关设备
Fotseu et al. GenNER-A highly scalable and optimal NER method for text-based gene and protein recognition
CN111507109A (zh) 电子病历的命名实体识别方法及装置
JP2019074982A (ja) 情報検索装置、検索処理方法、およびプログラム
US20200125804A1 (en) Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018567241

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18826482

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018826482

Country of ref document: EP

Effective date: 20200220