CN114781387B

CN114781387B - Medical named entity recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114781387B
Application number: CN202210694213.3A
Authority: CN
Inventors: 张卓仁; 王实; 张奇
Original assignee: Beijing Huimeiyun Technology Co ltd
Current assignee: Beijing Huimeiyun Technology Co ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-02
Anticipated expiration: 2042-06-20
Also published as: CN114781387A

Abstract

The application provides a medical named entity recognition method, a device, electronic equipment and a storage medium, which belong to the technical field of natural language processing, and the method comprises the following steps: inputting a Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix; determining a plurality of valid tags from the manually labeled named entity identification data; calculating character scores in the character transfer matrix and transfer scores corresponding to a plurality of effective labels in the label transfer matrix to obtain a forward score matrix and a backtracking record matrix; determining the value of each element in the path vector from back to front in sequence based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix; and marking the Chinese sequence to be predicted, and determining a named entity result. By adopting the medical named entity identification method, the medical named entity identification device, the electronic equipment and the storage medium, the problems of long identification time and low identification efficiency in named entity identification are solved.

Description

Medical named entity recognition method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of natural language processing, in particular to a medical named entity recognition method, a medical named entity recognition device, electronic equipment and a storage medium.

Background

With the rapid development of the internet, information on the network is more and more abundant, which means that it is more and more difficult to find effective information in mass data quickly and accurately. The text in the network is called natural language, because of the composition structure of Chinese text, the premise for understanding the text is to extract the characteristics of words in the text, namely extracting the characteristics of useful structured data from unstructured text, and named entity identification is the task of extracting proper nouns such as personal names, place names, organization names and the like from massive natural language texts, so that the method has important research significance and value for the research of the method. At present, when named entity recognition is performed, a Conditional Random Field (CRF) layer needs to be decoded, and the decoding method is to traverse and calculate the score or probability of each path and then select the path with the largest score or probability as a predicted path.

However, when the named entity recognition method is adopted, if there are N category labels and the length of the decoded sentence is L, then the method needs to be performed

The secondary operation causes problems of long recognition time and low recognition efficiency.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device and a storage medium for identifying a named entity in medical science, so as to solve the problems of long identification time and low identification efficiency when identifying the named entity.

In a first aspect, an embodiment of the present application provides a medical named entity identification method, including:

inputting the Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix, wherein the character transfer matrix is used for representing the probability that characters in the Chinese sequence to be predicted are marked as each named entity label, and the label transfer matrix is used for representing the probability of mutual transfer among the named entity labels;

determining a plurality of valid tags from the manually labeled named entity identification data;

calculating character scores in the character transfer matrix and transfer scores corresponding to a plurality of effective labels in the label transfer matrix to obtain a forward score matrix and a backtracking record matrix;

taking a column number corresponding to the forward fraction with the largest value in the last column of the forward fraction matrix as the value of the last element of the path vector, and sequentially determining the value of each element in the path vector from back to front based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix;

and marking the Chinese sequence to be predicted according to the serial number corresponding to the named entity label in the path vector, and determining the named entity result of the Chinese sequence to be predicted.

Optionally, the performing operation on the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of valid tags in the tag transfer matrix to obtain a forward score matrix and a backtracking record matrix, includes: setting an initial row number as 1; taking the initial line number as a first target line number; determining values of a plurality of elements corresponding to a first target row number in the forward fractional matrix and values of a plurality of elements corresponding to the first target row number in the backtracking record matrix; taking the numerical value obtained by adding 1 to the first target line number as an updated initial line number, and determining whether the updated initial line number is smaller than the target length, wherein the target length is a numerical value which is 2 more than the number of characters of the Chinese sequence to be predicted; if the length is smaller than the target length, returning to execute the step of taking the initial line number as the first target line number.

Optionally, determining values of a plurality of elements corresponding to the first target row number in the forward fractional matrix and values of a plurality of elements corresponding to the first target row number in the backtracking record matrix includes: setting an initial column number as 1; taking the initial column number as a first target column number; determining the sum of values corresponding to the columns where the effective labels are located in the row before the first target row number in the forward score matrix and the values corresponding to the rows where the effective labels are located in each column which is less than or equal to the first target column number in the label transfer matrix as a plurality of candidate derived transfer scores; selecting the candidate derived transfer score with the largest value from the plurality of candidate derived transfer scores as a target derived transfer score; taking the sum of the target derived transfer fraction and the value of the target element in the character transfer matrix as the value of the target element in the forward fraction matrix, wherein the target element is the element corresponding to the first target row number and the first target column number; taking the serial number corresponding to the effective label corresponding to the target derivative transfer score as the value of the target element in the backtracking record matrix; taking the value obtained by adding 1 to the first target column number as an updated initial column number, and determining whether the updated initial column number is smaller than the target label number, wherein the target label number is a value which is 2 more than the number of the set named entity labels; if the number of the target labels is less than the number of the target labels, returning to execute the step of taking the initial column number as the first target column number;

optionally, the method includes that a column number corresponding to a forward score with a largest value in a last column of the forward score matrix is used as a value of a last element of the path vector, and based on the value of the last element of the path vector and a value of a target element in the backtracking record matrix, the value of each element in the path vector is sequentially determined from back to front, and the method includes: taking a numerical value which is 1 more than the number of the characters of the Chinese sequence to be predicted as an initial serial number; taking the initial serial number as a second target row number; taking the initial serial number as an element serial number, and taking a value corresponding to the element serial number in the path vector as a second target column number; taking the values of the elements corresponding to the second target row number and the second target column number in the backtracking record matrix as the values of the elements corresponding to the target sequence number in the path vector, wherein the target sequence number is a numerical value which is 1 less than the initial sequence number; taking the target sequence number as an updated initial sequence number, and determining whether the updated initial sequence number is greater than 0; if the number is larger than 0, returning to execute the step of taking the initial sequence number as the second target row number.

Optionally, determining a plurality of valid tags from the manually labeled named entity identification data comprises: counting the transfer times among different named entity labels from the named entity identification data of the manual labeling to obtain a manual labeling label transfer table; taking the row number corresponding to the value larger than 0 in the manual label transfer table as an effective label serial number; and taking the named entity tag corresponding to the effective tag serial number as an effective tag.

Optionally, the named entity identification data labeled manually does not include the named entity identification data corresponding to the chinese sequence to be predicted.

Optionally, labeling the to-be-predicted chinese sequence according to the sequence number corresponding to the named entity tag in the path vector, and determining the named entity result of the to-be-predicted chinese sequence, including: acquiring sequence numbers of named entity labels respectively corresponding to the path vector from the second element to the last element; determining a named entity tag corresponding to the serial number of the named entity tag; and taking the determined named entity label as a named entity result of the Chinese sequence to be predicted.

In a second aspect, the present application further provides a medical named entity recognition apparatus, including:

the score prediction module is used for inputting the Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix, wherein the character transfer matrix is used for representing the probability that characters in the Chinese sequence to be predicted are marked as each named entity label, and the label transfer matrix is used for representing the probability of mutual transfer among the named entity labels;

the tag determination module is used for determining a plurality of effective tags from the named entity identification data which are marked manually;

the operation module is used for operating the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of effective labels in the label transfer matrix to obtain a forward score matrix and a backtracking record matrix;

the backtracking module is used for taking a column number corresponding to a forward score with the largest value in the last column of the forward score matrix as the value of the last element of the path vector, and sequentially determining the value of each element in the path vector from back to front based on the value of the last element of the path vector and the value of a target element in the backtracking record matrix;

and the marking module is used for marking the Chinese sequence to be predicted according to the serial number corresponding to the named entity label in the path vector and determining the named entity result of the Chinese sequence to be predicted.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operated, the machine-readable instructions when executed by the processor performing the steps of the medical named entity recognition method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the medical named entity recognition method as described above.

The embodiment of the application brings the following beneficial effects:

according to the medical named entity identification method, the medical named entity identification device, the electronic equipment and the storage medium, a preset model can be used for conducting score prediction on a Chinese sequence to be predicted to obtain a character transfer matrix and a label transfer matrix, operation is conducted through transfer scores and character scores corresponding to a plurality of effective labels to obtain a forward score matrix and a backtracking record matrix, a named entity result of the Chinese sequence to be predicted is determined according to the forward score matrix and the backtracking record matrix, and the score or probability of each path is not calculated in a traversing mode.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 shows a flow chart of a medical named entity recognition method provided by an embodiment of the application;

FIG. 2 is a schematic structural diagram of a medical named entity recognition apparatus provided by an embodiment of the present application;

fig. 3 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

It is worth noting that before the present application, with the rapid development of the internet, the information on the network is more and more abundant, which means that it is more and more difficult to find effective information in mass data quickly and accurately. The text in the network is called natural language, because of the composition structure of Chinese text, the premise for understanding the text is to extract the characteristics of words in the text, namely extracting the characteristics of useful structured data from unstructured text, and named entity identification is the task of extracting proper nouns such as personal names, place names, organization names and the like from massive natural language texts, so that the method has important research significance and value for the research of the method. At present, when named entity identification is performed, a CRF layer needs to be decoded, and the decoding method is to traverse and calculate the score or probability of each path, and then select the path with the largest score or probability as a predicted path. However, when the named entity recognition method is adopted, if there are N category labels and the length of the decoded sentence is L, then the method needs to be performed

Based on this, the embodiment of the application provides a medical named entity identification method, so that the identification efficiency is improved, and the identification time is shortened.

Referring to fig. 1, fig. 1 is a flowchart of a medical named entity recognition method according to an embodiment of the present disclosure. As shown in fig. 1, the medical named entity recognition method provided in the embodiment of the present application includes:

step S101, inputting a Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix.

In this step, the Chinese sequence to be predicted may refer to a Chinese sequence to be labeled with a named entity tag.

As an example, the chinese sequence to be predicted may or may not include punctuation marks.

The preset model may refer to a model which is trained and can output a character transfer matrix and a label transfer matrix, and the preset model is used for determining the character transfer matrix and the label transfer matrix.

As an example, the preset model may be a neural network model.

The character transfer matrix is used for representing the probability that characters in the Chinese sequence to be predicted are marked as each named entity label, the number of lines of the character transfer matrix is determined by the number of characters of the Chinese sequence to be predicted, and the number of columns of the character transfer matrix is determined by the number of the named entity labels.

The label transfer matrix is used for representing the probability of mutual transfer between the named entity labels, and the number of rows and the number of columns of the label transfer matrix are both determined by the number of the named entity labels.

In the embodiment of the application, the Chinese sequence to be predicted is taken as: for example, "the liver edge is smooth, the size ratio of each leaf is normal", each Chinese character is a character, the punctuation mark is also counted as 1 character, and the number of characters of the Chinese sequence to be predicted is 15. Before determining the number of named entity labels, determining named entity categories, for example, 5 categories of the named entity categories are named entity categories, namely, orientation (POS), anatomical region (BDY), Symptom (SYM), observed object (WAT), and Attribute (ATT), where the finest granularity is different in different labeling systems, and the finest granularity labels of the time category in the "biees" labeling system are 4 named entity labels, namely, B-TIM, I-TIM, E-TIM, and S-TIM, where B, I, E, S before the symbol "-" is used to represent a specific position of each character in a word, for example: b represents Begin, E represents End, taking the word liver as an example, the liver is the first character in the word and is marked as B-BDY, and the zang is the ending character in the word and is marked as E-BDY. Assuming that the finest granularity of each named entity category is 4, the 5 named entity categories have 4 × 5+1=21 named entity labels, where the named entity label corresponding to the punctuation mark and the special character is 0, and therefore 1 is added on the basis of 20 named entity labels. It can be seen that the character transfer matrix is a 15 × 21 matrix and the label transfer matrix is a 21 × 21 matrix.

It should be noted that, the order of the indexes of each column in the character transfer matrix is consistent with the order of the indexes of each row and each column in the label transfer matrix, for example: and indexes of all columns in the character transfer matrix are B-BDY, E-EDY, B-POS and E-POS in sequence, indexes of the first row to the fourth row of the label transfer matrix are B-BDY, E-EDY, B-POS and E-POS in sequence, and indexes of the first column to the fourth column of the label transfer matrix are B-BDY, E-EDY, B-POS and E-POS in sequence.

Step S102, determining a plurality of effective labels from the named entity identification data marked manually.

In this step, the manually labeled named entity identification data may refer to the manually labeled historical named entity identification data.

The named entity identification data of the manual labeling comprises named entity labels corresponding to different Chinese sequences, and the named entity identification data of the manual labeling can be used as a labeling reference to determine a plurality of effective labels.

The effective tags can refer to a set of named entity tags with a transfer relation in the Chinese sequence to be predicted, and the effective tags are used for representing the number of predicted paths, namely, the search range of the predicted paths corresponding to the Chinese sequence to be predicted is narrowed.

In the embodiment of the application, historical named entity identification data marked manually is obtained first, a plurality of effective labels corresponding to the Chinese sequence to be predicted can be determined from the historical data, and the named entity result of the Chinese sequence to be predicted can be determined by utilizing the effective labels, the character transfer matrix and the label transfer matrix.

In an alternative embodiment, performing step S102 includes: counting the transfer times among different named entity labels from the named entity identification data labeled manually to obtain a manual labeling label transfer table; taking the row number corresponding to the value larger than 0 in the manual label transfer table as an effective label serial number; and taking the named entity label corresponding to the effective label serial number as an effective label.

Here, taking the example of transferring the named entity tag a to the named entity tag B, the transfer frequency may refer to the frequency that the next tag of the named entity tag a appears as the named entity tag B in a certain chinese sequence, the transfer frequency is used to represent the probability that the next tag adjacent to the named entity tag a is the named entity tag B, the more the transfer frequency, the higher the probability that the next tag is the named entity tag B, and the less the transfer frequency, the lower the probability that the next tag is the named entity tag B.

Taking a Chinese sequence as 'smooth liver edge' as an example, the artificially labeled named entity labels are B-BDY, E-BDY, B-POS, E-POS, B-SYM and E-SYM in sequence, that is, the liver is labeled as B-BDY, the viscera is labeled as E-BDY, the edge is labeled as B-POS, the edge is labeled as E-POS, the light is labeled as B-STM and the slide E-STM, so that the artificial labeling label transfer table shown in Table 1 can be obtained.

Table 1 is a manual label transfer table corresponding to "smooth edge of liver".

As shown in Table 1, since the liver word in the Chinese sequence "liver edge smoothing" is followed by a dirty word, that is, the named entity tag B-BDY is followed by the named entity tag E-BDY, the number of transfers from B-BDY to E-BDY is 1, and similarly, the number of transfers from E-BDY to B-POS is 1, the number of transfers from B-POS to E-POS is 1, the number of transfers from E-POS to B-SYM is 1, and the number of transfers from B-SYM to E-SYM is 1.

It can be understood that, in table 1, since all the transition times corresponding to the E-SYM are 0, it indicates that there is no path for transitioning from the E-SYM to other named entity tags in the chinese sequence "liver edge smoothing", and therefore, there is no need to traverse the predicted path corresponding to the named entity tag.

In an optional embodiment, the named entity identification data labeled manually does not include the named entity identification data corresponding to the chinese sequence to be predicted.

Here, since the named entity identification data of the manual annotation determines the manual annotation tag transfer table, the manual annotation tag transfer table determines the valid tag, and the valid tag determines the kind of the named entity tag to be annotated for the chinese sequence to be predicted. Therefore, if the named entity identification data labeled manually includes the named entity identification data corresponding to the chinese sequence to be predicted, the accuracy of the preset model for predicting the chinese sequence to be predicted will be affected.

Step S103, the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of effective labels in the label transfer matrix are calculated to obtain a forward score matrix and a backtracking record matrix.

In this step, the character score may refer to a value in a character transfer matrix, where the character score is used to represent the probability that a character in the chinese sequence to be predicted is labeled as a corresponding named entity tag, a larger character score indicates a higher probability that the character in the chinese sequence to be predicted is labeled as the corresponding named entity tag, and a smaller character score indicates a lower probability that the character in the chinese sequence to be predicted is labeled as the corresponding named entity tag.

The transition score can indicate a value in the label transition matrix, the transition score is used for representing the probability of the label of the current named entity transferring to the corresponding named entity label, the higher the transition score is, the higher the probability of the label of the current named entity transferring to the corresponding named entity label is, and the smaller the transition score is, the lower the probability of the label of the current named entity transferring to the corresponding named entity label is.

The forward score matrix may refer to a matrix that records forward scores for recording the probability that a character in the chinese sequence to be predicted is labeled as a corresponding named entity tag.

The forward fractional matrix is a two-dimensional matrix with the matrix size of (L + 1) × (N + 1), wherein L represents the number of characters of the Chinese sequence to be predicted, and N represents the number of named entity tags.

The backtracking record matrix may refer to a matrix for recording positions of the named entity tags, and the backtracking record matrix is used for recording positions of the named entity tags corresponding to the path with the maximum score in the forward score matrix.

The trace-back record matrix is a two-dimensional matrix with a matrix size of (L + 1) × (N + 1).

The forward score can refer to a value in a forward score matrix, and the forward score is used for representing the probability that characters in the Chinese sequence to be predicted are marked as the corresponding named entity labels.

As an example, the forward score of the y column in the x row in the forward score matrix represents the cumulative value of the maximum score for all rows before the x row to reach the y column in the x row.

The trace-back record value may refer to a value in a trace-back record matrix, and the trace-back record value is used to represent a position of a named entity label corresponding to a maximum score path in a previous row of a current row in a forward score matrix.

By way of example, the trace-back record value of the x-th row and y-th column in the trace-back record matrix represents that the forward score of the x-th row and y-th column in the forward score matrix is calculated from the forward score of the x-1-th row in the forward score matrix, that is, the trace-back record value of the x-th row and y-th column in the trace-back record matrix represents that the forward score of the x-th row and y-th column in the forward score matrix is used by the forward score of the x-1-th column in the forward score matrix, and then the transition score and the character score are added.

In an alternative embodiment, performing step S103 includes: setting an initial row number as 1; taking the initial line number as a first target line number; determining values of a plurality of elements corresponding to a first target row number in a forward fractional matrix and values of a plurality of elements corresponding to the first target row number in a backtracking record matrix; taking a numerical value obtained by adding 1 to the first target line number as an updated initial line number, and determining whether the updated initial line number is smaller than a target length, wherein the target length is a numerical value which is 2 more than the number of characters of the Chinese sequence to be predicted; if the length is smaller than the target length, returning to execute the step of taking the initial line number as the first target line number.

Here, the first target row number may refer to a number corresponding to a target row, the first target row number corresponding to the first row in different matrices is different, the first target row number corresponding to the first row in the forward fractional matrix is 0, the first target row number corresponding to the second row is 1, and so on.

The first target row number corresponding to the first row in the character transfer matrix, the label transfer matrix and the backtracking record matrix is 1, the first target row number corresponding to the second row is 2, and so on. And the value of the first row in the backtracking record matrix is null.

The first target column number may refer to a number corresponding to a target column, and the first target column number is used to determine corresponding column sequences in different matrices under the condition of the same value.

Taking the above example as an example, L represents the number of characters of the chinese sequence to be predicted, the target length is L + 2.

Specifically, the values of all elements in the first row in the forward fractional matrix are determined, then the values of all elements in the second row are determined, and so on, and the values of all elements in the forward fractional matrix are determined. And when the value of one element in the forward fractional matrix is determined, the value of the element at the corresponding position in the backtracking record matrix is also determined.

In an optional embodiment, determining values of a plurality of elements corresponding to the first target row number in the forward fractional matrix and values of a plurality of elements corresponding to the first target row number in the backtracking record matrix includes: setting an initial column number as 1; taking the initial column number as a first target column number; determining the sum of values corresponding to the columns where the effective labels are located in the row before the first target row number in the forward score matrix and the values corresponding to the rows where the effective labels are located in each column which is less than or equal to the first target column number in the label transfer matrix as a plurality of candidate derived transfer scores; selecting the candidate derived transfer score with the largest value from the plurality of candidate derived transfer scores as a target derived transfer score; taking the sum of the target derived transfer fraction and the value of the target element in the character transfer matrix as the value of the target element in the forward fraction matrix, wherein the target element is the element corresponding to the first target row number and the first target column number; taking the serial number corresponding to the effective label corresponding to the target derivative transfer score as the value of the target element in the backtracking record matrix; taking a numerical value obtained by adding 1 to the first target column number as an updated initial column number, and determining whether the updated initial column number is smaller than the target label number, wherein the target label number is a numerical value which is 2 more than the number of the set named entity labels; if the number of the target labels is less than the number of the target labels, returning to execute the step of taking the initial column number as the first target column number.

Here, the row number of the forward fractional matrix is counted from 0, and thus, when the initial row number is 1, the first target row number is also 1, and the first target row number indicates the second row of the forward fractional matrix at this time.

It should be noted that, the value of each element in the first row of the forward fractional matrix is a set value, for example: all are-1000, and when the values of all elements in the forward fractional matrix are calculated, the calculation is started from the elements in the second row and the first column.

The following describes the calculation process of the values of the elements in the forward fractional matrix with reference to tables 2, 3, and 4.

Here, assuming that there are 4 named entity labels in total, and the chinese sequence to be predicted is 3 characters, the forward score matrix can be represented as in table 2.

Table 2 is a forward fractional matrix.

As shown in table 2, the row number of the first row in the forward fractional matrix is 0, and the row number in the forward fractional matrix from 1 represents the serial number of the character in the chinese sequence to be predicted. The values of the first row element and the last column element in the forward fractional matrix are-1000, wherein the values of the last column element from the second row are directly assigned in the calculation process.

Table 3 is a label transfer matrix.

As shown in table 3, each row index and each column index of the label transfer matrix are names of named entity labels, and the sequence of the named entity labels in the rows and columns is the same.

Assuming that the plurality of active tags are B-BDY and E-BDY, then in calculating the forward fractional matrix

When elements exist, the value corresponding to the effective label B-BDY in the front row in the forward fractional matrix is taken

Adding the value 20 of the row where the B-BDY is located in the first column of the label transfer matrix to determine a candidate derived transfer score, and taking the value corresponding to the effective label E-BDY of the previous row in the forward score matrix

And adding the value 18 of the row where the E-BDY is positioned in the first column of the label transfer matrix to determine another candidate derivative transfer score.

In computing forward fractional matrices

When the element is in use, the value corresponding to the effective label B-BDY in the previous row in the forward fractional matrix is taken

Adding the value 20 of the row where the B-BDY is located in the first column of the label transfer matrix to determine a first candidate derived transfer score, and taking the value corresponding to the effective label E-BDY of the previous row in the forward score matrix

Adding the value 18 of the row where the E-BDY is located in the first column of the label transfer matrix to determine a second candidate derivative transfer score, and adding the value corresponding to the effective label B-BDY in the previous row of the forward score matrix

Adding the value 65 of the row where the B-BDY is positioned in the second column of the label transfer matrix to determine a third candidate derivative transferFraction, the value corresponding to the effective label E-BDY in the front row in the forward fraction matrix is taken

And adding the value of the row where the E-BDY is positioned in the second column of the label transfer matrix to the value of 90 to determine a fourth candidate derivative transfer score.

Table 4 is a character transfer matrix.

As shown in Table 4, the initial row number of the character transfer matrix is 1, and the row number of the character transfer matrix represents the serial number of the character in the Chinese sequence to be predicted. The order of the labels (i.e., entity naming labels) is the same as the order of the row indices and column indices in the label transfer matrix. In computing forward fractional matrices

When the element is in the element, selecting the candidate derived transfer score with the largest value from the four candidate derived transfer scores as a target derived transfer score, and then taking the sum of the target derived transfer score and the target element in the character transfer matrix as a forward score matrix

The value of the element, i.e. the result of adding the target derived transfer score to the value 50 of the second row and the second column in the character transfer matrix, is taken as the forward score matrix

And (4) taking the value of the element.

In determining a forward fractional matrix

After the elements, the values of the elements at the corresponding positions in the backtracking record matrix can be determined. Taking the above example as an example, assume a target derived transition score pairAnd if the corresponding candidate derivative transfer score is the third candidate derivative transfer score, taking the serial number 2 corresponding to the effective label B-BDY as the value of the second row and the second column in the backtracking record matrix. The sequence number refers to a sequence number of the named entity tag, and when the named entity tag is determined, a sequence number is allocated to each named entity tag to serve as a unique identifier of the named entity tag.

By reference to the above-mentioned determined forward fractional matrix

The values of the elements and the values of the elements at the corresponding positions in the backtracking record matrix can be determined by the way of the values of the elements, so that the values of each element in the forward fractional matrix and the backtracking record can be determined.

In specific implementation, the values of the elements in the forward fractional matrix and the backtracking record matrix can be determined through the following steps a to I. Wherein steps a to I are not shown in the figure.

Step A, setting an initial value of a variable r, wherein the initial value of the variable r is 2, setting a forward fractional matrix as a (L + 1) x (N + 1) two-dimensional matrix, setting values of all elements in a first row of the forward fractional matrix as-1000, and setting a backtracking record matrix as the (L + 1) x (N + 1) two-dimensional matrix.

And step B, setting the initial value of the variable c, c to be 1.

And step C, setting the initial values of the variable max _ v and the variables max _ i, max _ v to be-1000000, the initial value of the max _ i to be 1, setting the initial values of the variables k and k to be 1, determining the line numbers corresponding to all elements with values larger than 0 in the manual label transfer table, and forming a line vector by the line numbers according to the sequence from small to large, wherein the line vector is named valid _ last _ tags, and valid _ last _ tags [1] represents the first element of the line vector valid _ last _ tags.

And D, determining the value of valid _ last _ tags [ k ], assigning the value to a variable i, calculating the sum of the values of the r-1 th row and the ith column of the forward fractional matrix and the values of the ith row and the ith column in the label transfer matrix, assigning the sum to a variable tmp, determining whether tmp is greater than max _ v, assigning the value of tmp to max _ v if the tmp is greater than max _ v, and assigning the value of i to max _ i.

And E, updating the value of k by using the value obtained by adding 1 to k, determining whether the updated value of k is greater than N, executing the step F if the updated value of k is greater than N, and returning to execute the step D if the updated value of k is not greater than N.

And F, determining whether the value of c is equal to N +1, changing the values of the r row and the c column in the forward fractional matrix to-1000 if the value of c is equal to N +1, and taking the sum of max _ v and the values of the r-1 row and the c column in the character transfer matrix as the values of the r row and the c column in the forward fractional matrix if the value of c is not equal to N + 1.

And G, taking the value of the max _ i as the value of the r row and the c column in the backtracking record matrix.

And step H, updating the value of C by using the value obtained by adding 1 to the C, determining whether the updated value of C is equal to N +2, executing the step I if the updated value of C is equal to N +2, and returning to execute the step C if the updated value of C is not equal to N + 2.

And step I, updating the value of r by using the value obtained by adding 1 to the r, determining whether the updated value of r is equal to L +2, ending the process if the updated value of r is equal to L +2, and returning to execute the step B if the updated value of r is not equal to L + 2.

And step S104, taking the column number corresponding to the forward score with the maximum value in the last column of the forward score matrix as the value of the last element of the path vector, and sequentially determining the value of each element in the path vector from back to front based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix.

In this step, the path vector may refer to a vector recording the path with the largest score, and the path vector is used to record the sequence number of the named entity tag corresponding to the chinese sequence to be predicted.

The path vector is a one-dimensional vector of length L + 1.

In an optional embodiment, a column number corresponding to a forward score with a largest value in a last column of a forward score matrix is used as a value of a last element of a path vector, and the value of each element in the path vector is sequentially determined from back to front based on the value of the last element of the path vector and the value of a target element in a backtracking record matrix, and the method includes: taking a numerical value which is 1 more than the number of the characters of the Chinese sequence to be predicted as an initial serial number; taking the initial serial number as a second target row number; taking the initial serial number as an element serial number, and taking a value corresponding to the element serial number in the path vector as a second target column number; taking the values of the elements corresponding to the second target row number and the second target column number in the backtracking record matrix as the values of the elements corresponding to the target sequence number in the path vector, wherein the target sequence number is a numerical value which is 1 less than the initial sequence number; taking the target sequence number as an updated initial sequence number, and determining whether the updated initial sequence number is greater than 0; if the number is larger than 0, returning to execute the step of taking the initial sequence number as the second target row number.

Here, the second target row number may refer to the number of the target row, and the second target column number may refer to the number of the target column.

Specifically, the determination process of the value of the element in the path vector may be completed through the following steps M to S, which are not shown in the figure.

And step M, creating a path vector path. The length of the path vector is set to L + 1.

And step N, determining the column number corresponding to the forward score with the maximum value in the last row of the forward score matrix, and assigning the column number to last _ index.

And step O, taking the value of last _ index as the value of the last element of the path vector.

And P, setting a variable j, setting the value of j as L +1, wherein j is a real number.

And step Q, changing the value of the path [ j-1] into the value of the jth row and the jth [ j ] column in the backtracking record matrix.

And step R, changing the value of j into a numerical value smaller than the value of j by 1.

And S, judging whether the value of j is equal to 0, if so, executing the step H, otherwise, returning to execute the step Q.

In an optional embodiment, labeling the chinese sequence to be predicted according to the sequence number corresponding to the named entity tag in the path vector, and determining the named entity result of the chinese sequence to be predicted includes: acquiring sequence numbers of named entity labels respectively corresponding to the path vector from a second element to a last element; determining a named entity tag corresponding to the serial number of the named entity tag; and taking the determined named entity label as a named entity result of the Chinese sequence to be predicted.

Here, it is assumed that: the Chinese sequence to be predicted is 'mixed hemorrhoid with bleeding', the values of the second element to the last element in the path vector path are 2, 3, 4, 10, 2 and 4 in sequence, the named entity label corresponding to the serial number 2 is B-SYM, the named entity label corresponding to the serial number 3 is I-SYM, the named entity label corresponding to the serial number 4 is E-SYM, and the named entity label corresponding to the serial number 10 is S-CNJ. The named entity results for the Chinese sequence "mixed hemorrhoids with bleeding" to be predicted are obtained as shown in Table 5.

Table 5 shows the results of named entities for "mixed hemorrhoids with bleeding".

As shown in Table 5, according to the notation of the "BIOES" system, the following named entities in "mixed hemorrhoid with hemorrhage" were obtained: "mixed hemorrhoid" is SYM (symptom), "accompanied" is CNJ (additive), "hemorrhage" is SYM (symptom).

Compared with the traversal method in the prior art, the technical scheme provided by the application can improve the decoding speed and the decoding efficiency.

Suppose the number of characters of the Chinese sequence to be predicted to be identified by the named entity is L, and X kinds of named entity categories and N kinds of named entity labels are shared in total. Then, in the technical solution provided in the present application, in the decoding stage, from the actual test result, the time complexity is

。

However, the decoding method using traversal has a time complexity of about in the decoding stage

. Because L and N are in the actual use processAnd does not tend to be infinite, the 1/3 described above need not be omitted.

Then when

By using the method, the decoding efficiency can be greatly improved, and the operation time of a computer is reduced. If the average L is 500 and N is 200 in the actual using process, the decoding is faster than the traversal decoding method by using the technical scheme provided by the application through calculation

And (4) doubling.

Compared with the method for recognizing the named entity in the prior art, the method for recognizing the named entity in the Chinese sequence to be predicted can utilize the preset model to perform score prediction on the Chinese sequence to be predicted to obtain the character transfer matrix and the label transfer matrix, and utilize the transfer scores and the character scores corresponding to the effective labels to perform operation to obtain the forward score matrix and the backtracking record matrix, determine the named entity result of the Chinese sequence to be predicted according to the forward score matrix and the backtracking record matrix, does not calculate the score or probability of each path in a traversing manner, and solves the problems of long recognition time and low recognition efficiency during named entity recognition.

Based on the same inventive concept, a medical named entity recognition device corresponding to the medical named entity recognition method is also provided in the embodiments of the present application, and as the principle of solving the problem of the device in the embodiments of the present application is similar to that of the medical named entity recognition method in the embodiments of the present application, the implementation of the device can refer to the implementation of the method, and the repeated parts are not described again.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a medical named entity recognition apparatus according to an embodiment of the present disclosure. As shown in fig. 2, the medical named entity recognition apparatus 200 includes:

the score prediction module 201 is configured to input the chinese sequence to be predicted to a preset model to obtain a character transfer matrix and a label transfer matrix, where the character transfer matrix is used to represent a possibility that a character in the chinese sequence to be predicted is labeled as each named entity label, and the label transfer matrix is used to represent a possibility that the named entity labels are transferred to each other;

a tag determination module 202, configured to determine a plurality of valid tags from the manually labeled named entity identification data;

the operation module 203 is configured to operate the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of effective tags in the tag transfer matrix to obtain a forward score matrix and a backtracking record matrix;

a backtracking module 204, configured to take a column number corresponding to a forward score with a largest value in a last column of the forward score matrix as a value of a last element of the path vector, and determine a value of each element in the path vector sequentially from back to front based on the value of the last element of the path vector and a value of a target element in the backtracking record matrix;

and the labeling module 205 is configured to label the to-be-predicted Chinese sequence according to the sequence number corresponding to the named entity tag in the path vector, and determine a named entity result of the to-be-predicted Chinese sequence.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 3, the electronic device 300 includes a processor 310, a memory 320, and a bus 330.

The memory 320 stores machine-readable instructions executable by the processor 310, when the electronic device 300 runs, the processor 310 communicates with the memory 320 through the bus 330, and when the machine-readable instructions are executed by the processor 310, the steps of the medical named entity recognition method in the embodiment of the method shown in fig. 1 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the medical named entity identification method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A medical named entity recognition method, comprising:

inputting a medical Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix, wherein the character transfer matrix is used for representing the probability that characters in the medical Chinese sequence to be predicted are marked as each named entity label, the label transfer matrix is used for representing the probability of mutual transfer among the named entity labels, the named entity labels are labels corresponding to named entity classes, and the named entity classes comprise: orientation, anatomical region, symptom, observation object, attribute;

determining a plurality of effective labels from the named entity identification data labeled manually, wherein the effective labels are a set of named entity labels with transfer relations in the medical Chinese sequence to be predicted;

calculating the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of effective labels in the label transfer matrix to obtain a forward score matrix and a backtracking record matrix;

taking the column number corresponding to the forward fraction with the largest value in the last column of the forward fraction matrix as the value of the last element of the path vector, and sequentially determining the value of each element in the path vector from back to front based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix;

and labeling the medical Chinese sequence to be predicted according to the sequence number corresponding to the named entity label in the path vector, and determining the named entity result of the medical Chinese sequence to be predicted.

2. The method according to claim 1, wherein the operating the character scores in the character transfer matrix and the transfer scores corresponding to a plurality of valid tags in the tag transfer matrix to obtain a forward score matrix and a backtracking record matrix comprises:

setting an initial row number as 1;

taking the initial line number as a first target line number;

determining values of a plurality of elements corresponding to a first target row number in the forward fractional matrix and values of a plurality of elements corresponding to the first target row number in the backtracking record matrix;

taking a numerical value obtained by adding 1 to the first target line number as an updated initial line number, and determining whether the updated initial line number is smaller than a target length, wherein the target length is a numerical value which is 2 more than the number of characters of the medical Chinese sequence to be predicted;

if the length is smaller than the target length, returning to execute the step of taking the initial line number as the first target line number.

3. The method of claim 2, wherein the determining values of a plurality of elements corresponding to a first target row number in the forward score matrix and values of a plurality of elements corresponding to a first target row number in the backtracking record matrix comprises:

setting an initial column number as 1;

taking the initial column number as a first target column number;

determining the sum of values corresponding to the columns where the plurality of effective labels are located in the row before the first target row number in the forward score matrix and the values corresponding to the rows where the plurality of effective labels are located in each column which is less than or equal to the first target column number in the label transfer matrix as a plurality of candidate derived transfer scores;

selecting a candidate derived transfer score with the largest value from the plurality of candidate derived transfer scores as a target derived transfer score;

taking the sum of the target derived transfer score and the value of a target element in the character transfer matrix as the value of the target element in a forward score matrix, wherein the target element is an element corresponding to the first target row number and the first target column number;

taking the serial number corresponding to the effective label corresponding to the target derivative transfer score as a value of a target element in a backtracking record matrix;

taking a numerical value obtained by adding 1 to the first target column number as an updated initial column number, and determining whether the updated initial column number is smaller than a target label number, wherein the target label number is a numerical value which is 2 more than the number of the set named entity labels;

and if the number of the labels is smaller than the target label number, returning to execute the step of taking the initial column number as the first target column number.

4. The method according to claim 1, wherein the step of determining the value of each element in the path vector in sequence from back to front based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix by taking a column number corresponding to the forward score with the largest value in the last column of the forward score matrix as the value of the last element of the path vector comprises:

taking a numerical value which is 1 more than the number of the characters of the medical Chinese sequence to be predicted as an initial serial number;

taking the initial serial number as a second target row number;

taking the initial serial number as an element serial number, and taking a value corresponding to the element serial number in the path vector as a second target column number;

taking the values of the elements corresponding to the second target row number and the second target column number in the backtracking record matrix as the values of the elements corresponding to the target sequence number in the path vector, wherein the target sequence number is a numerical value which is 1 less than the initial sequence number;

taking the target sequence number as an updated initial sequence number, and determining whether the updated initial sequence number is greater than 0;

and if the number is larger than 0, returning to execute the step of taking the initial sequence number as a second target row number.

5. The method of claim 1, wherein determining a plurality of valid tags from the manually labeled named entity identification data comprises:

counting the transfer times among different named entity labels from the named entity identification data of the manual labeling to obtain a manual labeling label transfer table;

taking the row number corresponding to the value larger than 0 in the manual label transfer table as an effective label serial number;

and taking the named entity tag corresponding to the effective tag serial number as an effective tag.

6. The method of claim 5, wherein the manually labeled named entity recognition data does not include named entity recognition data corresponding to the medical Chinese sequence to be predicted.

7. The method according to claim 1, wherein the labeling the medical chinese sequence to be predicted according to the sequence number corresponding to the named entity tag in the path vector to determine the named entity result of the medical chinese sequence to be predicted comprises:

acquiring sequence numbers of named entity labels respectively corresponding to the path vector from a second element to a last element;

determining a named entity tag corresponding to the serial number of the named entity tag;

and taking the determined named entity label as a named entity result of the medical Chinese sequence to be predicted.

8. A medical named entity recognition apparatus, comprising:

the score prediction module is used for inputting the medical Chinese sequence to be predicted into a preset model to obtain a character transfer matrix and a label transfer matrix, wherein the character transfer matrix is used for representing the probability that characters in the medical Chinese sequence to be predicted are marked as each named entity label, the label transfer matrix is used for representing the probability of mutual transfer among the named entity labels, the named entity labels are labels corresponding to named entity categories, and the named entity categories comprise: orientation, anatomical region, symptom, observation object, attribute;

the system comprises a label determining module, a label determining module and a label analyzing module, wherein the label determining module is used for determining a plurality of effective labels from named entity identification data which are manually marked, and the effective labels are a set of named entity labels with transfer relations in a medical Chinese sequence to be predicted;

the backtracking module is used for taking a column number corresponding to the forward score with the largest value in the last column of the forward score matrix as the value of the last element of the path vector, and sequentially determining the value of each element in the path vector from back to front based on the value of the last element of the path vector and the value of the target element in the backtracking record matrix;

and the labeling module is used for labeling the medical Chinese sequence to be predicted according to the sequence number corresponding to the named entity label in the path vector and determining the named entity result of the medical Chinese sequence to be predicted.

9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operated, the processor executing the machine-readable instructions to perform the steps of the medical named entity recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the medical named entity recognition method according to any one of claims 1 to 7.