CN115270792A

CN115270792A - Medical entity identification method and device

Info

Publication number: CN115270792A
Application number: CN202210795182.0A
Authority: CN
Inventors: 王亦宁; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-11-01

Abstract

The invention relates to a medical entity identification method, which comprises the steps of acquiring an entity to be identified, labeling the entity and an entity label through a special symbol, and constructing an output template of a text generation model according to the entity and the entity label; constructing input and output of a text generation model; inputting a text sequence to be recognized and a first matrix, wherein the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to an output template; coding the first matrix through a coder to obtain coded representation of the text sequence to be recognized; calculating the encoded representation by a decoder to obtain a decoded representation; and training the text generation model according to the coding representation and the decoding representation to obtain a final decoding representation.

Description

Medical entity identification method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a medical entity identification method and device.

Background

Medical entity identification generally uses a sequence labeling method, a BME label is defined for each character, the characters of the beginning, the middle position and the end of the entity are respectively represented, and an O label represents the characters inside a non-entity; and then training a neural network model to fit the label of each element, finally performing post-processing on the prediction result, and combining the BME labels to obtain a final extraction result.

The problems existing in the prior art are as follows: when the sequence labeling method is used, the text granularity must be characters, and the method cannot process the identification tasks of non-continuous medical entities and nested medical entities.

Disclosure of Invention

The invention aims to provide a medical entity identification method and a device, which are used for solving the problems that text granularity must be characters when a sequence marking method is used in the prior art, and the method cannot process the identification tasks of non-continuous medical entities and nested medical entities,

the invention provides a medical entity identification method in a first aspect, which comprises the following steps:

acquiring an entity to be identified, labeling the entity and an entity label through a special symbol, and constructing an output template of a text generation model according to the entity and the entity label;

constructing input and output of the text generation model; the input is a text sequence to be recognized and a first matrix, and the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after preprocessing the identification result, and the identification result is displayed according to the output template;

coding the first matrix through a coder to obtain a coded representation of a text sequence to be recognized; calculating the encoded representation by a decoder to obtain a decoded representation;

and training the text generation model according to the coding representation and the decoding representation to obtain a final decoding representation.

In one possible implementation, the first matrix is determined according to the following method:

and preprocessing the text sequence to be recognized through a pre-training language model BART to obtain a first matrix.

In a possible implementation manner, the encoding the text sequence to be recognized by the encoder to obtain the encoded representation of the text sequence to be recognized specifically includes:

by the formula

Calculating the coded representation of each word in the text sequence to be recognized;

wherein,

representing the encoded representation of the t word in the nth layer,

in a possible implementation manner, the training the text generation model according to the encoded representation and the decoded representation to obtain a final decoded representation specifically includes:

calculating the decoding representation of each word through a first function to obtain a generation probability;

performing matrix transformation on the decoded representation to obtain a first matrix transformation result;

performing matrix transformation on the coded representation to obtain a second matrix transformation result;

calculating the score of a copying mechanism according to the first matrix conversion result and the second matrix conversion result;

calculating a balance factor according to the first matrix conversion result and the second matrix conversion result;

calculating a fusion score according to the balance factor, the score and the generation probability;

determining a word corresponding to the maximum probability as a generation result according to the fusion score;

sequentially combining the generated results of each word to obtain a final decoding representation;

and extracting the recognition result according to the special symbol.

In a possible implementation manner, the calculating, by the first function, the decoded representation for each word to obtain the generation probability specifically includes:

linearly changing the decoding expression through a first function to obtain a linear change result;

and calculating probability distribution according to the linear change result.

In a possible implementation manner, the first matrix is encoded by an encoder to obtain an encoded representation of a text sequence to be recognized; calculating the encoded representation by a decoder to obtain a decoded representation specifically comprising:

the coded representation is expressed by

Calculating;

wherein,

representing the coded representation of the t-th word sequence in the n-th layer, the top coded representation h^N，h^NRepresenting the coded representation of all words in the nth layer, v_tRepresents the input of the encoder at time t;

decoding is represented by the formula

Calculating;

wherein h is^NIndicating the hidden state that the encoder has obtained,

for decoded representation of the t-th word sequence in the n-th layer, u_tRepresenting the input to the decoder at time t.

In a second aspect, the present invention provides a medical entity identification apparatus, the apparatus comprising:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an entity to be recognized and marking the entity and an entity label through a special symbol;

the output template building module is used for building an output template of a text generation model according to the entity and the entity label;

an input output construction module for constructing input and output of the text generation model; the input is a text sequence to be recognized and a first matrix, and the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to the output template;

the coding and decoding module is used for coding the first matrix through a coder to obtain the coded representation of the text sequence to be identified; calculating the encoded representation by a decoder to obtain a decoded representation;

and the model training module is used for training the text generation model according to the coding representation and the decoding representation to obtain a final decoding representation.

In a third aspect, the present invention provides a chip system comprising a processor coupled to a memory, the memory storing program instructions, which when executed by the processor implement the medical entity identification method of any one of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, the computer program being executed by a processor to perform the medical entity identification method of any one of the first aspect.

In a fifth aspect, the invention provides a computer program product for causing a computer to perform the method of identifying a medical entity according to any one of the first aspect when the computer program product is run on the computer.

By applying the entity identification method provided by the invention, the entity identification is modeled into the text generation task by constructing the template, the barrier of identifying by using the sequence marking task is broken through, and meanwhile, the method also integrates the copying mechanism in the pointer network, can directly copy the entity in the original sentence into the template, and can solve the tasks of the discontinuous medical entity and the nested medical entity in the entity identification.

Drawings

Fig. 1 is a schematic flow chart of a medical entity identification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a source sentence and a result;

FIG. 3 is a flowchart of step 140 of FIG. 1;

fig. 4 is a schematic structural diagram of a medical entity identification apparatus according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a chip system according to a third embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer-readable storage medium provided in a fourth embodiment of the present invention;

fig. 7 is a schematic diagram of a computer program product according to a fifth embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a schematic flow chart of a medical entity identification method according to an embodiment of the present invention; the method is applied to a scenario when identifying a medical entity, as shown in fig. 1, and comprises the following steps:

step 110, acquiring an entity to be identified, labeling the entity and an entity label through a special symbol, and constructing an output template of a text generation model according to the entity and the entity label;

specifically, the entity to be identified is marked by using brackets, and tag information of the entity type is added inside the brackets. Referring to fig. 2, entities include symptoms and time, entity labels are fever, three days, and diarrhea in sequence, fever and diarrhea in the original sentence are marked as symptom labels in the output template, three days in the original sentence are marked as time labels, a special symbol is a bracket, and the labels can be marked using the bracket: the entities are put together so that the recognition result can be obtained by extracting the content in the brackets later.

Step 120, constructing input and output of a text generation model; inputting a text sequence to be recognized and a first matrix, wherein the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to an output template;

specifically, in step 110, the structure of the output template of the text generation model is described, and in step 120, the input and output of the text generation model are described.

For example, X = [ X ]₁，x₂...，x_n]Represents a sequence of input text words to be recognized, and V = [ V ]₁，v₂，...，v_n]And representing a first matrix obtained by preprocessing a text word sequence to be recognized through a generative pre-training language model BART.

The output of the text generation model is the result part in fig. 2, and can be represented as Y = [ Y =₁，y₂...，y_m]A second matrix of the output end can be obtained by preprocessing Y using the pre-training model BART, i.e., U = [ U ])₁，u₂...，u_m]. Since Y and X are partially overlapped, the output result can be expressed as entity_i＝(name：x_i...x_i+k) Wherein x is_i...x_i+kIndicating that Y is a word corresponding to X. The Name represents an entity Name, i.e., the entity described above.

Step 130, coding the first matrix through a coder to obtain coded representation of the text sequence to be identified; calculating the encoded representation by a decoder to obtain a decoded representation;

specifically, an encoder is used to encode X to obtain an encoded representation of the input sequence information. Definition of Self_enc() For an encoder computation unit based on the self-attention mechanism, the encoded representation of each word passing through the encoder can be calculated by the following formula:

wherein,

the coded representation of the t-th word sequence in the n-th layer is represented by an encoder, and the topmost coded representation h can be obtained^N，h^NRepresenting coded representations of all words in the nth layer, v_tThe input to the encoder at time t is represented, for example, by a vector in the second matrix.

Decoder network dependency h^NAnd a convolution attention mechanism module to obtain a decoded representation. Definition of Self_dec() For the decoder computation unit based on self-attention, the output hidden state of the decoder at the time t

Calculated from the following formula:

wherein h is^NRepresenting the hidden state obtained by the encoder, wherein the hidden state is either the encoded representation or the decoded representation, for example, the hidden state obtained by the encoder is the encoded representation, the hidden state obtained by the decoder is the decoded representation,

for decoded representation of the t-th word sequence in the n-th layer, u_tRepresenting the input to the decoder at time t,may be the input to the decoder at time t, and may be, for example, a vector in the first matrix.

Step 140, training the text generation model according to the coding representation and the decoding representation to obtain a final decoding representation.

Therein, referring to fig. 3, step 140 comprises the steps of:

1401, calculating the decoding representation of each word through a first function to obtain a generation probability;

the decoding representation is subjected to linear change through a first function, and a linear change result is obtained; and calculating probability distribution according to the linear change result.

In particular, the uppermost hidden state of the decoder output

After a layer of linear transformation, the following is shown:

wherein, O_tThe first function is the softmax function, which is an input representation of the softmax layer.

For linear variation result O_tAnd outputting the probability distribution of each time t in the target word set combination Z through a first function softmax. Where the target vocabulary set Z refers to the candidate set of all words generated by the model output, and softmax is then used to compute the probability distribution in these candidate word sets.

Prob_gen＝softmax(W·Ot+b)

Wherein Prob_genFor probability distribution, W and b are training parameters of the model, and the dimension of W is the same as the dimension of Z in the word list set.

1402, performing matrix transformation on the decoded representation to obtain a first matrix transformation result;

specifically, the current hidden layer state of the decoder is obtained

Performing matrix transformation to obtain a first matrix transformation result, as shown in the following formula:

wherein q is_tIn order to be the result of the first matrix conversion,

is a trainable first parameter matrix used to perform a first matrix transformation.

Step 1403, performing matrix transformation on the coded representation to obtain a second matrix transformation result;

specifically, the highest hidden layer state h of the encoder^NPerforming matrix transformation to obtain a second matrix transformation result, which is shown as the following formula:

wherein K and V are the second matrix transformation results,

a second parameter matrix which can be trained and used for the second matrix transformation,

is a trainable third parameter matrix used to perform the second matrix transformation.

Step 1404, calculating a score of a copy mechanism according to the first matrix conversion result and the second matrix conversion result;

in particular, q is obtained_tK, V calculate the score of the copy mechanism:

wherein Prob_copyAre scores.

Step 1405, calculating a balance factor according to the first matrix conversion result and the second matrix conversion result;

wherein the balance factor is calculated by the following formula:

wherein, q obtained by each calculation is added_tK, V are summed and then W is summed^TMultiplying, and calculating by a second function and a sigmoid function to obtain a balance factor

W^TIs a trainable transformation matrix.

Step 1406, calculating a fusion score according to the balance factor, the score and the generation probability;

specifically, the fusion score is calculated according to the following formula;

wherein Prob_finalAnd (4) scoring the fusion, namely, finally obtaining a fusion score for each word sequence.

Step 1407, determining the word corresponding to the maximum probability as a generation result according to the fusion score;

specifically, the word corresponding to the maximum probability is selected as the generation result of the time t, and the following formula is shown:

y_t＝Max(Prob_final)

wherein, y_tIs the result of the generation at time t.

Step 1408, combining the generated results of each word in turn to obtain a final decoded representation;

and step 1409, extracting the identification result according to the special symbol.

Specifically, according to 1401 to 1407, the generation results of a plurality of words are sequentially obtained, and the generation results of the words are combined to obtain the final decoded representation, for example, the final decoded representation may be represented as Y = [ Y ]₁，y₂...，y_m]Wherein, y₁，y₂...，y_mThe combination of the multiple generated results obtained in sequence according to steps 1401-1407 may be referred to as a final decoded representation, and for the final decoded representation, the corresponding identification result entity may be extracted through brackets_i。

By applying the entity identification method provided by the invention, the entity identification is modeled into a text generation task by constructing a template, the barrier of identifying by using a sequence marking task is broken, and meanwhile, the method also integrates a copy mechanism in a pointer network, so that the entity in the original sentence can be directly copied into the template, and the tasks of a discontinuous medical entity and a nested medical entity in the entity identification can be solved.

Example two

An embodiment of the present invention provides a medical entity identification apparatus, as shown in fig. 4, the apparatus includes: the system comprises an acquisition module 410, an output template construction module 420, an input and output construction module 430, a coding and decoding module 440 and a model training module 450.

The obtaining module 410 is configured to obtain an entity to be identified, and label the entity and an entity tag through a special symbol;

the output template building module 420 is used for building an output template of the text generation model according to the entity and the entity label;

the input and output construction module 430 is used for constructing the input and output of the text generation model; the input is a text sequence to be identified and a first matrix, and the first matrix is obtained after the text to be identified is preprocessed; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to an output template;

the encoding and decoding module 440 is configured to encode the first matrix through an encoder to obtain an encoded representation of the text sequence to be identified; calculating the encoded representation by a decoder to obtain a decoded representation;

the model training module 450 is configured to train the text generation model according to the encoded representation and the decoded representation to obtain a final decoded representation.

Further, the input output construction module 430 determines the first matrix according to the following method: and preprocessing the text sequence to be recognized through a pre-training language model BART to obtain a first matrix.

Further, the encoding and decoding module 440 encodes the text sequence to be recognized through the encoder to obtain the encoded representation of the text sequence to be recognized specifically includes:

by the formula

Calculating the code representation of each word in the text sequence to be recognized; wherein,

representing the encoded representation of the t word in the nth layer,

further, the training of the text generation model by the model training module 450 according to the encoding representation and the decoding representation to obtain the final decoding representation specifically includes: calculating the decoding representation of each word through a first function to obtain a generation probability; performing matrix transformation on the decoding representation to obtain a first matrix transformation result; performing matrix transformation on the coded representation to obtain a second matrix transformation result; calculating the score of a copying mechanism according to the first matrix conversion result and the second matrix conversion result; calculating a balance factor according to the first matrix conversion result and the second matrix conversion result; calculating a fusion score according to the balance factor, the score and the generation probability; determining a word corresponding to the maximum probability as a generation result according to the fusion score; sequentially combining the generated results of each word to obtain a final decoding representation; and extracting the recognition result according to the special symbol.

Further, the calculating, by the model training module 450 through the first function, each word decoding representation to obtain the generation probability specifically includes: linearly changing the decoding representation through a first function to obtain a linear change result; and calculating probability distribution according to the linear change result.

Further, the encoding and decoding module 440 encodes the first matrix through an encoder to obtain an encoded representation of the text sequence to be identified; calculating the encoded representation by a decoder to obtain a decoded representation specifically comprising:

the code is expressed by formula

Calculating;

wherein,

representing the coded representation of the t-th word sequence in the n-th layer, the top coded representation h^N，h^NRepresenting coded representations of all words in the nth layer, v_tRepresents the input of the encoder at time t;

decoding is expressed by the formula

Calculating;

wherein h is^NIndicating the hidden state that the encoder has obtained,

The apparatus provided in the second embodiment of the present invention can execute the method steps in the first embodiment of the method, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can all be implemented in the form of software invoked by a processing element; or can be realized in a hardware mode completely; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the determining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and may be called by a processing element of the apparatus to execute the functions of the determining module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the above method or the above modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above modules are implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application occur wholly or partially upon loading and execution of the computer program instructions on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.) means.

EXAMPLE III

A third embodiment of the present invention provides a chip system, as shown in fig. 5, which includes a processor, where the processor is coupled to a memory, and the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the chip system implements any one of the medical entity identification methods provided in the first embodiment.

Example four

A fourth embodiment of the present invention provides a computer-readable storage medium, as shown in fig. 6, which includes a program or instructions, and when the program or instructions are run on a computer, the method for identifying a medical entity according to any one of the methods provided in the first embodiment is implemented.

EXAMPLE five

Embodiment five provides a computer program product comprising instructions, as shown in fig. 7, which when run on a computer, cause the computer to perform any one of the medical entity identification methods provided in embodiment one.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A medical entity identification method, the method comprising:

constructing input and output of the text generation model; the input is a text sequence to be recognized and a first matrix, and the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to the output template;

2. The method of claim 1, wherein the first matrix is determined according to the following method:

3. The method according to claim 1, wherein the encoding the text sequence to be recognized by the encoder to obtain the encoded representation of the text sequence to be recognized specifically comprises:

by the formula

Calculating the code representation of each word in the text sequence to be recognized;

wherein,

representing the encoded representation of the t word in the nth layer,

4. the method of claim 1, wherein the training the text generation model according to the encoded representation and the decoded representation to obtain a final decoded representation specifically comprises:

and extracting the recognition result according to the special symbol.

5. The method of claim 4, wherein said calculating the decoded representation for each word by the first function to obtain the generation probability specifically comprises:

and calculating probability distribution according to the linear change result.

6. The method according to claim 1, wherein the first matrix is encoded by an encoder to obtain an encoded representation of a text sequence to be recognized; calculating the encoded representation by a decoder to obtain a decoded representation specifically comprising:

the coded representation is expressed by

Calculating;

wherein,

decodingIs expressed by formula

Calculating;

wherein h is^NIndicating the hidden state that the encoder has obtained,

7. A medical entity identification apparatus, characterized in that the apparatus comprises:

the output template construction module is used for constructing an output template of a text generation model according to the entity and the entity label;

an input-output construction module for constructing an input and an output of the text generation model; the input is a text sequence to be recognized and a first matrix, and the first matrix is obtained after preprocessing the text to be recognized; the output is an identification result and a second matrix, the second matrix is obtained after the identification result is preprocessed, and the identification result is displayed according to the output template;

8. A chip system comprising a processor coupled to a memory, the memory storing program instructions that, when executed by the processor, implement the medical entity identification method of any of claims 1-6.

9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program is executed by a processor for performing the method for identifying a medical entity according to any one of claims 1-6.

10. A computer program product, characterized in that it causes a computer to carry out the medical entity identification method according to any one of claims 1-6, when said computer program product is run on the computer.