CN116341557A

CN116341557A - Diabetes medical text named entity recognition method

Info

Publication number: CN116341557A
Application number: CN202310616459.3A
Authority: CN
Inventors: 石琳; 邹先明
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-06-27

Abstract

The invention discloses a diabetes medical text named entity identification method, and relates to the field of electric digital data processing. After the corpus of the related diabetes medical texts is obtained, preprocessing the corpus and labeling the entities. The text word vector is extracted by RoBERTa-wwm, the extracted text vector is input into a local context sensing module, and information containing local important features is captured by fusing multi-window attention through residual convolution under multiple scales, wherein the multi-window attention can effectively capture the local important semantic components under windows with different sizes. A self-attention mechanism is added to solve the limit that a bidirectional gating circulating unit captures long-distance dependence, and global semantic information is acquired. Finally decoding is performed by means of conditional random fields. The invention establishes a standard data set for identifying the named entity of the diabetes medical text, and the provided identification method can well solve the technical problem of poor identification effect caused by the ambiguous and entity co-pointing phenomenon of the diabetes medical text.

Description

Diabetes medical text named entity recognition method

Technical Field

The invention relates to the field of electric digital data processing, in particular to a diabetes medical text named entity identification method.

Background

Diabetes is a common metabolic disease that severely jeopardizes human health. With the popularity of electronic health record systems, the number of medical text data has shown to increase in bursts. These text data contain a large amount of medical information of diabetes, such as symptoms, diagnosis, treatment, etc. Therefore, diabetes medical text information processing is one of the hot spots of current research.

Named entity recognition (Name Entity Recognition, NER) is an important task in the field of diabetes medical text information processing. The purpose is to automatically identify entities associated with diabetes, such as disease names, drug names, surgical names, etc., from the text.

In recent years, there are four main approaches to NER research: rule-based methods, statistical-based machine learning methods, deep learning-based methods, and NER methods using pre-trained models.

The NER earliest is mainly a rule and dictionary based method that relies on domain experts to manually construct corresponding rule templates and use matching methods to process text. This method requires a lot of manpower and makes it impossible to perform transplantation between different fields. Following this are methods based on statistical machine learning, including Hidden Markov Models (HMMs), maximum Entropy Models (MEMs), support Vector Machines (SVMs), conditional Random Fields (CRFs), and the like. Deep learning has now made significant progress in applications in the field of natural language processing, particularly in the NER task, with great effectiveness. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are currently the most common neural network models in the NER field. In addition, pre-training models are also widely used in the NER field. The pre-training model can learn the language model through large-scale corpus training and obtain rich language expression capability. In diabetes medicine NER, a great deal of specialized vocabulary has the phenomenon of ambiguity and entity co-pointing, and most NER methods send generated character vectors directly into a bidirectional long-short-Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) or a bidirectional gating circulation unit (Bidirectional Gated Recurrent Unit, biGRU) to acquire global features, and lack of consideration of local optimal features, so that the recognition accuracy of the medical specialized vocabulary is low, the performance of the recognition method is low, and the development of Chinese diabetes medical entity recognition is hindered.

Disclosure of Invention

Aiming at the problems, the invention aims to overcome the defects of the prior art and provide a method for identifying the named entities of the diabetes medical texts, which is used for improving the accuracy of identifying the named entities of the diabetes medical texts and increasing the effectiveness of identifying the named entities in the diabetes medical field.

In view of the above problems, an object of the present invention is to provide a method for identifying a named entity of a medical text for diabetes, comprising:

step 1, acquiring corpus of related diabetes medical texts in a network, and preprocessing the corpus to construct a corpus. The pretreatment process comprises the following steps: and performing word segmentation on the unlabeled text, removing stop words and the like, and obtaining a corpus.

Step 2, marking the data in the language database, wherein the data marking adopts a BIOES marking mode to mark, and a data set is generated;

step 3, acquiring semantic vectors of word level by using a RoBERTa-wwm pre-training model;

step 4, inputting the semantic vector of the word level output in the step 3 to a local context sensing module to capture local feature information of the text;

step 5, the output local characteristic information in the step 4 is sent into the BiGRU combined with the self-attention mechanism layer to capture global characteristic information;

and 6, aiming at semantic features of the global feature information in the step 5, learning a dependency relationship between adjacent labels by using a conditional random field to obtain an optimal label sequence, and further completing identification of the diabetes medical text entity.

According to the identification method of the diabetes medical text named entity, after text corpus related to diabetes medical in a network is obtained, the text corpus is processed to obtain a data set, and the entities in the data set are marked by adopting a BIOES marking method. And extracting word vectors of the text data by using a RoBERTa-wwm pre-training model, effectively extracting local information of the text data by using a local context sensing module, capturing global feature information of the text by using a BiGRU layer introduced with a self-attention mechanism, finally inputting the global feature information into a conditional random field for decoding, outputting a tag sequence with highest probability, obtaining a tag class of each character, improving the accuracy of identifying the named entity of the text of the diabetes medicine, and improving the identification effectiveness of the named entity in the diabetes medicine field.

Preferably, in step 2, when labeling the data, the medical entities are classified into six entities of examination index, drug name, adverse reaction, site, surgery and disease. And marking the entities in the data set by adopting a BIOES marking method. B represents the beginning of an entity, I represents the middle of an entity, E represents the end of an entity, S represents a single character, which itself is an entity, O represents a non-entity, and is not of any type.

Preferably, in step 3, the RoBERTa-wwm pre-trained model is able to learn to a maximum extent more of the language representation and word-level semantic representation.

Preferably, the Roberta-wwm pre-training model is improved on the basis of Roberta and Chinese word masking techniques. Roberta-wwm uses dynamic masks instead of static masks for BERT. The dynamic mask randomly selects different words each time to mask, which increases the randomness of the model input and allows the model to learn more diversified language representations. RoBERTa-wwm uses a full word mask instead of the single word mask of BERT. The whole word mask will mask the whole word instead of a single character, which helps to improve the model's understanding of the vocabulary.

Preferably, in step 4, the local context awareness module includes:

the recognition local context sensing module is formed by combining a multi-window attention mechanism and a multi-scale residual convolution neural network, wherein the multi-window attention mechanism can effectively capture important local feature semantic components under windows with different sizes. The obtained semantic vector is input into a multi-scale residual convolution neural network, the local feature perception capability of the CNN is improved by setting a plurality of convolution kernels with different sizes, the local feature information under different scales is fully extracted, and the residual structure is used for fusing the semantic information under different scales. The obtained semantic information is input into a BiGRU layer combined with a self-attention mechanism for training.

Preferably, in step 5, the biglu combined with the self-attention mechanism layer is used to capture global feature information, and the self-attention mechanism is used to screen key information in the input text, so as to solve the problem of long-distance dependence of the biglu layer. The biglu layer includes an update gate and a reset gate. Update door Z _t The function of (2) is to control the gating state to determine how much information h of the previous moment is _t-1 Will be transferred to the current time h _t And from the candidate state according to its gating state

Optionally receiving information. Reset gate r _t It is responsible for managing how the candidate state is +>

And at last moment h _t-1 Is fused with the information of the (c). The calculation formula of each state of the GRU unit is as follows:

，/>

，/>

，

wherein x is _t And (3) inputting information at the current moment t, wherein W is a weight matrix, sigma is a sigmoid function, and tanh is a tanh function.

Preferably, in step 6, the conditional random field calculates the probability of each tag sequence using the learned relationship between adjacent tags, and outputs the tag sequence with the highest probability, thereby determining the tag class of each character.

Preferably, the probability score formula of the output tag sequence y given the input sequence x is:

wherein A is a transfer matrix, A _y(i-1),yi Denoted as label y _(i-1) Transfer to y _i Probability of P _i,yi Expressed as the ith character mark y of the input sequence x _i Probability of the tag; the conditional probability distribution formula for sequence y is:

in the training process of the conditional random field, a maximum likelihood method is cited to maximize the probability of the correct label sequence y', and the calculation formula is as follows: />

Finally, the Viterbi algorithm is utilized to obtain a label sequence with the highest score, namely a globally optimal result output by the conditional random field, and the calculation formula is as follows: />

。

The identification method of the diabetes medical named entity provided by the invention has the following advantages:

the invention provides a diabetes medical text named entity recognition method combining local feature extraction and global feature extraction, which is used for carrying out word segmentation on a pre-acquired diabetes medical text, removing stop word processing and establishing a data set; marking the diabetes medical entities in the data set according to the BIOES marking mode; sending the text data into a pre-training model RoBERTa-wwm to obtain text vectors with word-level semantic representations; and the local context sensing module is used for extracting multi-scale local characteristics of the diabetes medical text. By applying a multi-window attention mechanism to text vectors that result in word-level semantic representations, important semantic components of local features can be effectively captured under windows of different sizes. The convolution layer improves the local feature perception capability of the CNN by setting a plurality of convolution kernels with different sizes, and efficiently calculates the convolution kernels in parallel so as to fully extract the local feature information under different scales. Next, fusing semantic information at different scales using a residual structure; local semantic information fused by the residual error structure under different scales is sent into BiGRU combined with the self-attention mechanism layer, the limitation of capturing long-distance dependence of the BiGRU layer is solved by adding the self-attention mechanism, and global semantic information is obtained; finally, the correlation between adjacent labels in the sequence is captured through a Conditional Random Field (CRF) to obtain the optimal label sequence. The semantic vector with word level is obtained through the RoBERTa-wwm pre-training model, and the local feature and the global feature can be effectively considered by adopting a method of combining a local context sensing module and BiGRU added with a self-attention mechanism. Compared with various improvements made by the traditional mainstream method, the method can well solve the technical problem that the identification effect is poor due to the ambiguous and entity co-pointing phenomenon of the diabetes medical text.

Drawings

FIG. 1 is a flowchart of a method for identifying a named entity of a diabetes medical text provided by an embodiment of the invention;

fig. 2 is a schematic diagram of a multi-scale residual convolution network according to an embodiment of the present invention;

fig. 3 is a model frame diagram of a method for identifying a named entity of a diabetes medical text according to an embodiment of the invention.

Detailed Description

The invention will now be described in more detail with reference to the drawings and to specific embodiments. The invention provides a method for identifying a diabetes medical text named entity, which is oriented to the field of diabetes medicine and solves the technical problem that the identification effect is poor due to the fact that the existing technology has ambiguity and entity co-pointing phenomena on the diabetes medical entity.

Referring to fig. 1, a flowchart of a method for identifying a named entity of a diabetes medical text is provided in an embodiment of the present invention, including the following steps:

step S1, acquiring corpus related to diabetes medical texts in a network, segmenting unlabeled texts, removing stop words and the like, and obtaining a data set.

Aiming at the problems that the research on named entities in the current diabetes medical field is less and the labeling data set is rare, a large number of diabetes medical text-related corpora are obtained from medical websites, medical encyclopedia websites and the like, and a Chinese diabetes corpus is constructed. Then, data cleaning is carried out on the acquired corpus, such as space character removal, line feed character removal, carriage return character removal, tab making and the like; removing duplicate content to improve data quality; word segmentation is carried out on sentences in the text: chinese text is subjected to word segmentation processing by using a jieba word segmentation tool, and continuous strings are segmented into meaningful words. Stop words such as "have," "are," and the like words having no actual meaning are removed.

Step S2: and labeling the data in the data set by adopting a BIOES labeling mode. B represents the beginning of an entity, I represents the middle of an entity, E represents the end of an entity, S represents a single character, which itself is an entity, O represents a non-entity, and is not of any type. The data are labeled, including dividing the medical entities into six entities, namely, examination indexes, drug names, adverse reactions, parts, operations and diseases.

And S3, extracting word vectors of the text data by using the RoBERTa-wwm pre-training model. Here, roBERTa-wwm is similar to BERT in terms of model structure, using a Transformer model, but RoBERTa-wwm makes some modifications and adjustments to BERT model, e.g., eliminating the next sentence prediction (next sentence prediction) task of BERT, using a longer maximum length, dynamically adjusting batch size. RoBERTa-wwm uses a full word masking approach, i.e., masking the entire word during training, rather than randomly selecting a sub-word to mask as in the original BERT model. This enables RoBERTa-wwm to better process the languages of Chinese characters such as Chinese, and improves the representation of Chinese language.

Here, the RoBERTa-wwm model is a pretraining model based on massive corpus self-supervision learning feature representation, and is characterized in that the RoBERTa feature can be embedded as a high-quality word to solve a natural language processing task. The pre-training of RoBERTa-wwm uses a mask word prediction task (MLM) approach to extract features by randomly masking 15% of the sequence tags and generating vectors embedded into the RoBERTa pre-training model. Unlike BERT, roBERTa-wwm employs a larger corpus and longer training time to improve model performance. Compared with the dictionary of BERT, special identifiers are added in the dictionary of RoBERTa-wwm, such as [ CLS ] is a sentence head identifier, [ SEP ] is a separator, [ UNK ] is an unknown identifier, [ MASK ] is a MASK identifier, 80% of the 15% sequence of the random MASK strategy is masked by [ MASK ], 10% of the probability is replaced by a word in the text sequence, and 10% of the probability is not changed. The embedding of the Roberta-wwm model includes word vectors (Token vectors), sentence vectors (Segmentation Embeddings), and position vectors (Position Embeddings). Token references are implemented by marking text, [ CLS ] and [ SEP ] are used to mark the beginning and end of a piece of text sequence. Segment Embedding are used to distinguish tokens in different sentences and to assist the model in understanding the relationships between different sentences. Segment Embedding is a learnable embedded vector that is used to indicate whether a token belongs to a previous sentence or a subsequent sentence. Position Embeddings is similar to Position Embeddings in the transducer model but differs in implementation for processing the position information of the token in the sequence. Position Embeddings of Roberta-wwm is learned rather than using a fixed position coding scheme.

And S4, inputting word-level semantic information output in the step S3 into a local context sensing module to capture local feature information of the text. Here, the local context awareness module encodes the character sequence output by the RoBERTa model while implicitly grouping related characters to capture correlations in the local context for extracting multi-scale local features of the diabetes text. Wherein w=w _ch As an input representation of each character, and representing the character insert as w _ch ∈R ^ech 。

Here, the convolution window size of CNN is set to k, and each character embedding includes a position embedding identical to the window size k. The index embedded at this location ranges from 0 to k-1, where if the current index corresponds to the location of the corresponding character in the windowThe initial value is 1, otherwise 0. In this way, the CNN can encode the position information of each character in context into its embedded vector, thereby capturing the order dependency of the characters in the sequence. Embedding dimension e _g =e _ch +e _pos Wherein e is _ch E is the representation form of the CNN character vector _pos A position information vector. To capture the semantic relationship between the center character and surrounding characters, a method of combining CNNs under different convolution window sizes with a multi-window attention mechanism is applied. The method can effectively focus on the local context of each character and strengthen the semantic relation between the central character and the surrounding characters.

In the multi-window attention layer, the central character j with the window size of k is taken as the center, and the central character j and other surrounding characters are input by w _j-((k-1)/2) ,…,w _j ,…,w _j+((k-1)/2) To represent, these inputs ultimately produce k implicit vectors h _j-((k-1)/2) ,…,h _j ,…,h _j+((k-1)/2) Having a length e _ch The calculation formula is as follows:

wherein m ε { j- ((k-1)/2), …, j+ ((k+1)/2), a _m For attention weight, a _m The calculation formula of (2) is as follows: />

The score function calculation formula is: />

Wherein v is R ^ech ，w1,w2∈R ^ech,eg ，e _g Is the embedding dimension. The vector sequence h thus obtained _{j-((k-1)/2): j+((k-1)/2)} The extracted local semantic features are expressed by convolution operations of different convolution kernel sizes as: />

Here, the multi-scale residual convolution network structure fuses the multi-scale local context information to obtain more effective characteristic information, and network fallback does not occur while increasing network depth is ensuredFor problems, please refer to fig. 2. Except for the first layer of CNN, the input of each layer of CNN is a fusion feature vector obtained by connecting the input and the output of the last layer of CNN through residual errors, and finally the feature vector output by each layer of CNN is spliced to obtain the output of the local context sensing module. "+" represents a splicing operation. C=c ₁ +c ₂ +…+c _i . And S5, sending the semantic information of the local context sensing module output in the step S4 into the BiGRU combined with the self-attention mechanism layer to capture global feature information.

The BiGRU layer adopts a bidirectional GRU neural network, consists of a forward hidden layer and a backward hidden layer, can simultaneously acquire two different vector representations of the input information at the current moment, and combines the two different vector representations into the input information d= [ d1, d2, …, dn at the current moment]Thereby, deep text feature extraction is performed, and the dependency relationship of the context is better understood. The GRU includes two gating units: an update gate and a reset gate. Wherein the update gate replaces the input gate and the forget gate in LSTM. Update door Z _t The function of (2) is to control the gating state to determine how much information h of the previous moment is _t-1 Will be transferred to the current time h _t And from candidate states according to their gating states

And at last moment h _t-1 Is fused with the information of the (c). The respective state calculation formulas of the GRU units are as follows: />

，

，/>

，/>

Further, a self-attention mechanism is employed to address the limitation of big-distance dependence of biglu layer capture. As with the self-attention mechanism in the transducer, it focuses only on the relationships between characters within the input sequence, thereby finding the relationship between the different characters and selecting the most representative and critical words and phrases from them. The calculation formula is:>

wherein Q represents a query matrix, K represents a key matrix, V represents a value matrix, d _k Representing the dimensions of Q and K. The attention mechanism calculates the similarity scores of the query matrix Q and all key matrices K, scales them so that they are not too large in high dimensions, divided by a scaling factor +.>

. This avoids numerical problems with the softmax function when calculated. These scores are then converted to normalized weights using a softmax function and applied to a value matrix V to obtain a weighted vector representation. These weight vector representations are used to express the importance of different parts of the input sequence.

Step S6, the global semantic information output in the step S5 is sent to a conditional random field to obtain an optimal tag sequence, wherein the probability score formula of the output tag sequence y of the conditional random field under the condition of the given input sequence x is as follows:

wherein A is a transfer matrix, A _y(i-1),yi Denoted as y _(i-1) Transfer to y _i Probability of P _i,yi Expressed as the ith character mark y of the input sequence x _i Probability of the tag; the conditional probability distribution formula for sequence y is: />

In the training process of the conditional random field, a maximum likelihood method is cited to maximize the probability of the correct label sequence y', and a calculation formula is as follows: />

Finally, the Viterbi algorithm is utilized to obtain a label sequence with the highest score, namely a globally optimal result output by the conditional random field, and the calculation formula is as follows:

。

and (3) experimental verification:

in this embodiment, the overall framework of the proposed model is shown in fig. 3, the model firstly uses the RoBERTa-wwm pre-training model to extract word vectors of text data, effectively extracts local information of the word vectors output by the RoBERTa-wwm through the local context sensing module, captures global feature information of the text by introducing the biglu combined with the self-attention mechanism layer, and finally inputs the global feature information to the conditional random field for decoding, and outputs a tag sequence with highest probability to obtain tag class of each character. Model training was performed using the model of the annotation data in this example with the existing model, and the performance comparisons of the different models are shown in table 1:

TABLE 1 comparison of different model Performance

Model	P(%)	R(%)	F1(%)
				BiLSTM-CRF	73.63	74.65	74.14
BERT-CRF	79.19	81.94	80.55
				BERT-BiGRU-CRF	80.67	84.03	82.31
BERT-BiLSTM-CRF	81.33	84.72	82.99
				BERT-BiGRU-IDCNN-CRF	88.43	82.29	85.25
BERT-BiLSTM-IDCNN-CRF	87.22	80.56	83.75
				our	91.54	86.46	88.93

The experiment adopts three evaluation indexes of universal accuracy (P), recall (R) and F1 value (F1) for named entity recognition to evaluate the model effect. The specific formula is as follows:

P = Tp/(Tp+Fp)

R = Tp/(Tp+Fn)

F1 = 2PR/(P+R)

where Tp represents the true example, i.e., the number of positive samples predicted as positive examples, fp represents the false positive example, i.e., the number of negative samples predicted as positive examples, and Fn represents the false negative example, i.e., the number of positive samples predicted as negative examples. The experimental results show that the model provided in the embodiment fully considers the extraction of local features and global features, extracts the local information of sentences by using the local feature perception module, and adds a self-attention mechanism on the basis of the BiGRU layer to solve the problem of unreasonable weight distribution of the BiGRU to the characters. As can be seen from Table 1, the model proposed in the present embodiment is superior to various comparison models in terms of various performance indexes, which indicates that the model has a good recognition effect in the field of diabetes medicine, and can well solve the technical problem of poor recognition effect caused by the ambiguous and entity co-pointing phenomena of the diabetes medical text.

In the description of the above embodiments, the various technical features may be combined in any non-contradictory manner. In the interest of brevity, not all of the features of a particular embodiment are described in the context of a single embodiment. However, as long as the combination of these technical features does not cause contradiction, they should be considered to be within the scope of the present specification.

It is apparent that the above examples are only for the purpose of more clearly illustrating the present invention, and are not to be construed as limiting the embodiments of the present invention. Those skilled in the art can make various changes or modifications in form based on the above description. The present description is not intended to be exhaustive of all embodiments. Such modifications, substitutions and improvements are intended to be included within the spirit and scope of the invention as defined by the claims.

Claims

1. The method for identifying the diabetes medical text named entity is characterized by comprising the following steps of:

step 1, acquiring corpus of related diabetes medical texts in a network, and preprocessing the corpus to construct a corpus; the pretreatment process comprises the following steps: word segmentation is carried out on the text which is not marked, stop word processing is removed, and a corpus is obtained;

step 4, inputting the word-level semantic vector obtained in the step 3 by RoBERTa-wwm to a local context sensing module to capture local feature information of the text;

step 5, the local characteristic information obtained by the local context sensing module in the step 4 is sent into the BiGRU combined with the self-attention mechanism layer to capture global characteristic information;

and 6, aiming at global characteristic information output by the BiGRU combined with the self-attention mechanism layer, learning the dependency relationship between adjacent labels by utilizing a conditional random field to obtain an optimal label sequence, and further completing identification of the diabetes medical text entity.

2. The method for identifying a named entity of a medical text for diabetes according to claim 1, wherein the step 1 comprises the steps of performing data cleaning on the acquired corpus, including removing space symbols, line feed symbols, carriage return symbols, tab symbols, and removing repeated contents; word segmentation is carried out on sentences in the text; and removing the stop words.

3. The method for identifying the named entity of the medical text of the diabetes mellitus according to claim 1, wherein the labeling of the corpus data constructed in the step 1 in the step 2 comprises dividing the medical entity into six entities of examination indexes, drug names, adverse reactions, parts, operations and diseases; the marking of the entities in the data set by using the BIOES marking method comprises that B represents the beginning of the entity, I represents the middle of the entity, E represents the end of the entity, S represents a single character, the entity is an entity, O represents a non-entity and does not belong to any type.

4. The method of claim 1, wherein the RoBERTa-wwm pre-training model in step 3 is used to learn a plurality of language representations and word-level semantic representations.

5. The method of claim 1, wherein the local context awareness module in step 4 comprises: the recognition local context sensing module is formed by combining a multi-window attention mechanism and a multi-scale residual convolution neural network, wherein the multi-window attention mechanism can effectively capture important semantic components of local features under windows with different sizes; inputting the important semantic components of the obtained local features into a multi-scale residual convolution network, extracting local feature information under different scales by setting a plurality of convolution kernels with different sizes, and fusing the local semantic information under different scales by using a residual structure; the specific flow is as follows: except for the first layer of CNN, the input of each layer of CNN is a fusion feature vector obtained by connecting the input and the output of the last layer of CNN through residual errors, and finally the feature vector output by each layer of CNN is spliced, so that semantic information of the local context sensing module is obtained; semantic information obtained by the local context awareness module is input into the BiGRU combined with the self-attention mechanism layer for training.

6. The method for identifying a named entity of a medical text for diabetes according to claim 1, wherein in step 5, the biglu combined with the self-attention mechanism layer is used for capturing global feature information, wherein the self-attention mechanism is used for screening key information in an input text, and solving the problem of long-distance dependence of the biglu layer; the BiGRU layer comprises an update gate and a reset gate, wherein the update gate Z _t Information h for controlling its gating state to decide how much of the previous moment _t-1 Will be transferred to the current time h _t And from the candidate state according to its gating state

Selectively receiving information; the reset gate r _t It is responsible for managing how candidates are to be madeStatus->

And at last moment h _t-1 Is fused with the information of the (a); the calculation formula of each state of the GRU unit in the BiGRU layer is as follows:

，/>

，/>

，/>

7. The method for recognizing a named entity of a diabetic medical text according to claim 1, wherein in step 6, the conditional random field calculates the probability of each tag sequence using the learned relationship between adjacent tags, and outputs the tag sequence having the highest probability, thereby determining the tag class of each character.

8. The method for identifying a named entity of a medical text for diabetes according to claim 7, wherein the probability score formula of the output tag sequence y given the input sequence x is:

。