CN116341557A - Diabetes medical text named entity recognition method - Google Patents

Diabetes medical text named entity recognition method Download PDF

Info

Publication number
CN116341557A
CN116341557A CN202310616459.3A CN202310616459A CN116341557A CN 116341557 A CN116341557 A CN 116341557A CN 202310616459 A CN202310616459 A CN 202310616459A CN 116341557 A CN116341557 A CN 116341557A
Authority
CN
China
Prior art keywords
text
entity
information
local
diabetes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310616459.3A
Other languages
Chinese (zh)
Inventor
石琳
邹先明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Science and Technology
Original Assignee
North China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Science and Technology filed Critical North China University of Science and Technology
Priority to CN202310616459.3A priority Critical patent/CN116341557A/en
Publication of CN116341557A publication Critical patent/CN116341557A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a diabetes medical text named entity identification method, and relates to the field of electric digital data processing. After the corpus of the related diabetes medical texts is obtained, preprocessing the corpus and labeling the entities. The text word vector is extracted by RoBERTa-wwm, the extracted text vector is input into a local context sensing module, and information containing local important features is captured by fusing multi-window attention through residual convolution under multiple scales, wherein the multi-window attention can effectively capture the local important semantic components under windows with different sizes. A self-attention mechanism is added to solve the limit that a bidirectional gating circulating unit captures long-distance dependence, and global semantic information is acquired. Finally decoding is performed by means of conditional random fields. The invention establishes a standard data set for identifying the named entity of the diabetes medical text, and the provided identification method can well solve the technical problem of poor identification effect caused by the ambiguous and entity co-pointing phenomenon of the diabetes medical text.

Description

Diabetes medical text named entity recognition method
Technical Field
The invention relates to the field of electric digital data processing, in particular to a diabetes medical text named entity identification method.
Background
Diabetes is a common metabolic disease that severely jeopardizes human health. With the popularity of electronic health record systems, the number of medical text data has shown to increase in bursts. These text data contain a large amount of medical information of diabetes, such as symptoms, diagnosis, treatment, etc. Therefore, diabetes medical text information processing is one of the hot spots of current research.
Named entity recognition (Name Entity Recognition, NER) is an important task in the field of diabetes medical text information processing. The purpose is to automatically identify entities associated with diabetes, such as disease names, drug names, surgical names, etc., from the text.
In recent years, there are four main approaches to NER research: rule-based methods, statistical-based machine learning methods, deep learning-based methods, and NER methods using pre-trained models.
The NER earliest is mainly a rule and dictionary based method that relies on domain experts to manually construct corresponding rule templates and use matching methods to process text. This method requires a lot of manpower and makes it impossible to perform transplantation between different fields. Following this are methods based on statistical machine learning, including Hidden Markov Models (HMMs), maximum Entropy Models (MEMs), support Vector Machines (SVMs), conditional Random Fields (CRFs), and the like. Deep learning has now made significant progress in applications in the field of natural language processing, particularly in the NER task, with great effectiveness. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are currently the most common neural network models in the NER field. In addition, pre-training models are also widely used in the NER field. The pre-training model can learn the language model through large-scale corpus training and obtain rich language expression capability. In diabetes medicine NER, a great deal of specialized vocabulary has the phenomenon of ambiguity and entity co-pointing, and most NER methods send generated character vectors directly into a bidirectional long-short-Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) or a bidirectional gating circulation unit (Bidirectional Gated Recurrent Unit, biGRU) to acquire global features, and lack of consideration of local optimal features, so that the recognition accuracy of the medical specialized vocabulary is low, the performance of the recognition method is low, and the development of Chinese diabetes medical entity recognition is hindered.
Disclosure of Invention
Aiming at the problems, the invention aims to overcome the defects of the prior art and provide a method for identifying the named entities of the diabetes medical texts, which is used for improving the accuracy of identifying the named entities of the diabetes medical texts and increasing the effectiveness of identifying the named entities in the diabetes medical field.
In view of the above problems, an object of the present invention is to provide a method for identifying a named entity of a medical text for diabetes, comprising:
step 1, acquiring corpus of related diabetes medical texts in a network, and preprocessing the corpus to construct a corpus. The pretreatment process comprises the following steps: and performing word segmentation on the unlabeled text, removing stop words and the like, and obtaining a corpus.
Step 2, marking the data in the language database, wherein the data marking adopts a BIOES marking mode to mark, and a data set is generated;
step 3, acquiring semantic vectors of word level by using a RoBERTa-wwm pre-training model;
step 4, inputting the semantic vector of the word level output in the step 3 to a local context sensing module to capture local feature information of the text;
step 5, the output local characteristic information in the step 4 is sent into the BiGRU combined with the self-attention mechanism layer to capture global characteristic information;
and 6, aiming at semantic features of the global feature information in the step 5, learning a dependency relationship between adjacent labels by using a conditional random field to obtain an optimal label sequence, and further completing identification of the diabetes medical text entity.
According to the identification method of the diabetes medical text named entity, after text corpus related to diabetes medical in a network is obtained, the text corpus is processed to obtain a data set, and the entities in the data set are marked by adopting a BIOES marking method. And extracting word vectors of the text data by using a RoBERTa-wwm pre-training model, effectively extracting local information of the text data by using a local context sensing module, capturing global feature information of the text by using a BiGRU layer introduced with a self-attention mechanism, finally inputting the global feature information into a conditional random field for decoding, outputting a tag sequence with highest probability, obtaining a tag class of each character, improving the accuracy of identifying the named entity of the text of the diabetes medicine, and improving the identification effectiveness of the named entity in the diabetes medicine field.
Preferably, in step 2, when labeling the data, the medical entities are classified into six entities of examination index, drug name, adverse reaction, site, surgery and disease. And marking the entities in the data set by adopting a BIOES marking method. B represents the beginning of an entity, I represents the middle of an entity, E represents the end of an entity, S represents a single character, which itself is an entity, O represents a non-entity, and is not of any type.
Preferably, in step 3, the RoBERTa-wwm pre-trained model is able to learn to a maximum extent more of the language representation and word-level semantic representation.
Preferably, the Roberta-wwm pre-training model is improved on the basis of Roberta and Chinese word masking techniques. Roberta-wwm uses dynamic masks instead of static masks for BERT. The dynamic mask randomly selects different words each time to mask, which increases the randomness of the model input and allows the model to learn more diversified language representations. RoBERTa-wwm uses a full word mask instead of the single word mask of BERT. The whole word mask will mask the whole word instead of a single character, which helps to improve the model's understanding of the vocabulary.
Preferably, in step 4, the local context awareness module includes:
the recognition local context sensing module is formed by combining a multi-window attention mechanism and a multi-scale residual convolution neural network, wherein the multi-window attention mechanism can effectively capture important local feature semantic components under windows with different sizes. The obtained semantic vector is input into a multi-scale residual convolution neural network, the local feature perception capability of the CNN is improved by setting a plurality of convolution kernels with different sizes, the local feature information under different scales is fully extracted, and the residual structure is used for fusing the semantic information under different scales. The obtained semantic information is input into a BiGRU layer combined with a self-attention mechanism for training.
Preferably, in step 5, the biglu combined with the self-attention mechanism layer is used to capture global feature information, and the self-attention mechanism is used to screen key information in the input text, so as to solve the problem of long-distance dependence of the biglu layer. The biglu layer includes an update gate and a reset gate. Update door Z t The function of (2) is to control the gating state to determine how much information h of the previous moment is t-1 Will be transferred to the current time h t And from the candidate state according to its gating state
Figure SMS_1
Optionally receiving information. Reset gate r t It is responsible for managing how the candidate state is +>
Figure SMS_2
And at last moment h t-1 Is fused with the information of the (c). The calculation formula of each state of the GRU unit is as follows:
Figure SMS_3
,/>
Figure SMS_4
,/>
Figure SMS_5
Figure SMS_6
wherein x is t And (3) inputting information at the current moment t, wherein W is a weight matrix, sigma is a sigmoid function, and tanh is a tanh function.
Preferably, in step 6, the conditional random field calculates the probability of each tag sequence using the learned relationship between adjacent tags, and outputs the tag sequence with the highest probability, thereby determining the tag class of each character.
Preferably, the probability score formula of the output tag sequence y given the input sequence x is:
Figure SMS_7
wherein A is a transfer matrix, A y(i-1),yi Denoted as label y (i-1) Transfer to y i Probability of P i,yi Expressed as the ith character mark y of the input sequence x i Probability of the tag; the conditional probability distribution formula for sequence y is:
Figure SMS_8
in the training process of the conditional random field, a maximum likelihood method is cited to maximize the probability of the correct label sequence y', and the calculation formula is as follows: />
Figure SMS_9
Finally, the Viterbi algorithm is utilized to obtain a label sequence with the highest score, namely a globally optimal result output by the conditional random field, and the calculation formula is as follows: />
Figure SMS_10
The identification method of the diabetes medical named entity provided by the invention has the following advantages:
the invention provides a diabetes medical text named entity recognition method combining local feature extraction and global feature extraction, which is used for carrying out word segmentation on a pre-acquired diabetes medical text, removing stop word processing and establishing a data set; marking the diabetes medical entities in the data set according to the BIOES marking mode; sending the text data into a pre-training model RoBERTa-wwm to obtain text vectors with word-level semantic representations; and the local context sensing module is used for extracting multi-scale local characteristics of the diabetes medical text. By applying a multi-window attention mechanism to text vectors that result in word-level semantic representations, important semantic components of local features can be effectively captured under windows of different sizes. The convolution layer improves the local feature perception capability of the CNN by setting a plurality of convolution kernels with different sizes, and efficiently calculates the convolution kernels in parallel so as to fully extract the local feature information under different scales. Next, fusing semantic information at different scales using a residual structure; local semantic information fused by the residual error structure under different scales is sent into BiGRU combined with the self-attention mechanism layer, the limitation of capturing long-distance dependence of the BiGRU layer is solved by adding the self-attention mechanism, and global semantic information is obtained; finally, the correlation between adjacent labels in the sequence is captured through a Conditional Random Field (CRF) to obtain the optimal label sequence. The semantic vector with word level is obtained through the RoBERTa-wwm pre-training model, and the local feature and the global feature can be effectively considered by adopting a method of combining a local context sensing module and BiGRU added with a self-attention mechanism. Compared with various improvements made by the traditional mainstream method, the method can well solve the technical problem that the identification effect is poor due to the ambiguous and entity co-pointing phenomenon of the diabetes medical text.
Drawings
FIG. 1 is a flowchart of a method for identifying a named entity of a diabetes medical text provided by an embodiment of the invention;
fig. 2 is a schematic diagram of a multi-scale residual convolution network according to an embodiment of the present invention;
fig. 3 is a model frame diagram of a method for identifying a named entity of a diabetes medical text according to an embodiment of the invention.
Detailed Description
The invention will now be described in more detail with reference to the drawings and to specific embodiments. The invention provides a method for identifying a diabetes medical text named entity, which is oriented to the field of diabetes medicine and solves the technical problem that the identification effect is poor due to the fact that the existing technology has ambiguity and entity co-pointing phenomena on the diabetes medical entity.
Referring to fig. 1, a flowchart of a method for identifying a named entity of a diabetes medical text is provided in an embodiment of the present invention, including the following steps:
step S1, acquiring corpus related to diabetes medical texts in a network, segmenting unlabeled texts, removing stop words and the like, and obtaining a data set.
Aiming at the problems that the research on named entities in the current diabetes medical field is less and the labeling data set is rare, a large number of diabetes medical text-related corpora are obtained from medical websites, medical encyclopedia websites and the like, and a Chinese diabetes corpus is constructed. Then, data cleaning is carried out on the acquired corpus, such as space character removal, line feed character removal, carriage return character removal, tab making and the like; removing duplicate content to improve data quality; word segmentation is carried out on sentences in the text: chinese text is subjected to word segmentation processing by using a jieba word segmentation tool, and continuous strings are segmented into meaningful words. Stop words such as "have," "are," and the like words having no actual meaning are removed.
Step S2: and labeling the data in the data set by adopting a BIOES labeling mode. B represents the beginning of an entity, I represents the middle of an entity, E represents the end of an entity, S represents a single character, which itself is an entity, O represents a non-entity, and is not of any type. The data are labeled, including dividing the medical entities into six entities, namely, examination indexes, drug names, adverse reactions, parts, operations and diseases.
And S3, extracting word vectors of the text data by using the RoBERTa-wwm pre-training model. Here, roBERTa-wwm is similar to BERT in terms of model structure, using a Transformer model, but RoBERTa-wwm makes some modifications and adjustments to BERT model, e.g., eliminating the next sentence prediction (next sentence prediction) task of BERT, using a longer maximum length, dynamically adjusting batch size. RoBERTa-wwm uses a full word masking approach, i.e., masking the entire word during training, rather than randomly selecting a sub-word to mask as in the original BERT model. This enables RoBERTa-wwm to better process the languages of Chinese characters such as Chinese, and improves the representation of Chinese language.
Here, the RoBERTa-wwm model is a pretraining model based on massive corpus self-supervision learning feature representation, and is characterized in that the RoBERTa feature can be embedded as a high-quality word to solve a natural language processing task. The pre-training of RoBERTa-wwm uses a mask word prediction task (MLM) approach to extract features by randomly masking 15% of the sequence tags and generating vectors embedded into the RoBERTa pre-training model. Unlike BERT, roBERTa-wwm employs a larger corpus and longer training time to improve model performance. Compared with the dictionary of BERT, special identifiers are added in the dictionary of RoBERTa-wwm, such as [ CLS ] is a sentence head identifier, [ SEP ] is a separator, [ UNK ] is an unknown identifier, [ MASK ] is a MASK identifier, 80% of the 15% sequence of the random MASK strategy is masked by [ MASK ], 10% of the probability is replaced by a word in the text sequence, and 10% of the probability is not changed. The embedding of the Roberta-wwm model includes word vectors (Token vectors), sentence vectors (Segmentation Embeddings), and position vectors (Position Embeddings). Token references are implemented by marking text, [ CLS ] and [ SEP ] are used to mark the beginning and end of a piece of text sequence. Segment Embedding are used to distinguish tokens in different sentences and to assist the model in understanding the relationships between different sentences. Segment Embedding is a learnable embedded vector that is used to indicate whether a token belongs to a previous sentence or a subsequent sentence. Position Embeddings is similar to Position Embeddings in the transducer model but differs in implementation for processing the position information of the token in the sequence. Position Embeddings of Roberta-wwm is learned rather than using a fixed position coding scheme.
And S4, inputting word-level semantic information output in the step S3 into a local context sensing module to capture local feature information of the text. Here, the local context awareness module encodes the character sequence output by the RoBERTa model while implicitly grouping related characters to capture correlations in the local context for extracting multi-scale local features of the diabetes text. Wherein w=w ch As an input representation of each character, and representing the character insert as w ch ∈R ech
Here, the convolution window size of CNN is set to k, and each character embedding includes a position embedding identical to the window size k. The index embedded at this location ranges from 0 to k-1, where if the current index corresponds to the location of the corresponding character in the windowThe initial value is 1, otherwise 0. In this way, the CNN can encode the position information of each character in context into its embedded vector, thereby capturing the order dependency of the characters in the sequence. Embedding dimension e g =e ch +e pos Wherein e is ch E is the representation form of the CNN character vector pos A position information vector. To capture the semantic relationship between the center character and surrounding characters, a method of combining CNNs under different convolution window sizes with a multi-window attention mechanism is applied. The method can effectively focus on the local context of each character and strengthen the semantic relation between the central character and the surrounding characters.
In the multi-window attention layer, the central character j with the window size of k is taken as the center, and the central character j and other surrounding characters are input by w j-((k-1)/2) ,…,w j ,…,w j+((k-1)/2) To represent, these inputs ultimately produce k implicit vectors h j-((k-1)/2) ,…,h j ,…,h j+((k-1)/2) Having a length e ch The calculation formula is as follows:
Figure SMS_11
wherein m ε { j- ((k-1)/2), …, j+ ((k+1)/2), a m For attention weight, a m The calculation formula of (2) is as follows: />
Figure SMS_12
The score function calculation formula is: />
Figure SMS_13
Wherein v is R ech ,w1,w2∈R ech,eg ,e g Is the embedding dimension. The vector sequence h thus obtained j-((k-1)/2): j+((k-1)/2) The extracted local semantic features are expressed by convolution operations of different convolution kernel sizes as: />
Figure SMS_14
Here, the multi-scale residual convolution network structure fuses the multi-scale local context information to obtain more effective characteristic information, and network fallback does not occur while increasing network depth is ensuredFor problems, please refer to fig. 2. Except for the first layer of CNN, the input of each layer of CNN is a fusion feature vector obtained by connecting the input and the output of the last layer of CNN through residual errors, and finally the feature vector output by each layer of CNN is spliced to obtain the output of the local context sensing module. "+" represents a splicing operation. C=c 1 +c 2 +…+c i . And S5, sending the semantic information of the local context sensing module output in the step S4 into the BiGRU combined with the self-attention mechanism layer to capture global feature information.
The BiGRU layer adopts a bidirectional GRU neural network, consists of a forward hidden layer and a backward hidden layer, can simultaneously acquire two different vector representations of the input information at the current moment, and combines the two different vector representations into the input information d= [ d1, d2, …, dn at the current moment]Thereby, deep text feature extraction is performed, and the dependency relationship of the context is better understood. The GRU includes two gating units: an update gate and a reset gate. Wherein the update gate replaces the input gate and the forget gate in LSTM. Update door Z t The function of (2) is to control the gating state to determine how much information h of the previous moment is t-1 Will be transferred to the current time h t And from candidate states according to their gating states
Figure SMS_15
Optionally receiving information. Reset gate r t It is responsible for managing how the candidate state is +>
Figure SMS_19
And at last moment h t-1 Is fused with the information of the (c). The respective state calculation formulas of the GRU units are as follows: />
Figure SMS_21
Figure SMS_16
,/>
Figure SMS_18
,/>
Figure SMS_20
Further, a self-attention mechanism is employed to address the limitation of big-distance dependence of biglu layer capture. As with the self-attention mechanism in the transducer, it focuses only on the relationships between characters within the input sequence, thereby finding the relationship between the different characters and selecting the most representative and critical words and phrases from them. The calculation formula is:>
Figure SMS_22
wherein Q represents a query matrix, K represents a key matrix, V represents a value matrix, d k Representing the dimensions of Q and K. The attention mechanism calculates the similarity scores of the query matrix Q and all key matrices K, scales them so that they are not too large in high dimensions, divided by a scaling factor +.>
Figure SMS_17
. This avoids numerical problems with the softmax function when calculated. These scores are then converted to normalized weights using a softmax function and applied to a value matrix V to obtain a weighted vector representation. These weight vector representations are used to express the importance of different parts of the input sequence.
Step S6, the global semantic information output in the step S5 is sent to a conditional random field to obtain an optimal tag sequence, wherein the probability score formula of the output tag sequence y of the conditional random field under the condition of the given input sequence x is as follows:
Figure SMS_23
wherein A is a transfer matrix, A y(i-1),yi Denoted as y (i-1) Transfer to y i Probability of P i,yi Expressed as the ith character mark y of the input sequence x i Probability of the tag; the conditional probability distribution formula for sequence y is: />
Figure SMS_24
In the training process of the conditional random field, a maximum likelihood method is cited to maximize the probability of the correct label sequence y', and a calculation formula is as follows: />
Figure SMS_25
Finally, the Viterbi algorithm is utilized to obtain a label sequence with the highest score, namely a globally optimal result output by the conditional random field, and the calculation formula is as follows:
Figure SMS_26
and (3) experimental verification:
in this embodiment, the overall framework of the proposed model is shown in fig. 3, the model firstly uses the RoBERTa-wwm pre-training model to extract word vectors of text data, effectively extracts local information of the word vectors output by the RoBERTa-wwm through the local context sensing module, captures global feature information of the text by introducing the biglu combined with the self-attention mechanism layer, and finally inputs the global feature information to the conditional random field for decoding, and outputs a tag sequence with highest probability to obtain tag class of each character. Model training was performed using the model of the annotation data in this example with the existing model, and the performance comparisons of the different models are shown in table 1:
TABLE 1 comparison of different model Performance
Model P(%) R(%) F1(%)
BiLSTM-CRF 73.63 74.65 74.14
BERT-CRF 79.19 81.94 80.55
BERT-BiGRU-CRF 80.67 84.03 82.31
BERT-BiLSTM-CRF 81.33 84.72 82.99
BERT-BiGRU-IDCNN-CRF 88.43 82.29 85.25
BERT-BiLSTM-IDCNN-CRF 87.22 80.56 83.75
our 91.54 86.46 88.93
The experiment adopts three evaluation indexes of universal accuracy (P), recall (R) and F1 value (F1) for named entity recognition to evaluate the model effect. The specific formula is as follows:
P = Tp/(Tp+Fp)
R = Tp/(Tp+Fn)
F1 = 2PR/(P+R)
where Tp represents the true example, i.e., the number of positive samples predicted as positive examples, fp represents the false positive example, i.e., the number of negative samples predicted as positive examples, and Fn represents the false negative example, i.e., the number of positive samples predicted as negative examples. The experimental results show that the model provided in the embodiment fully considers the extraction of local features and global features, extracts the local information of sentences by using the local feature perception module, and adds a self-attention mechanism on the basis of the BiGRU layer to solve the problem of unreasonable weight distribution of the BiGRU to the characters. As can be seen from Table 1, the model proposed in the present embodiment is superior to various comparison models in terms of various performance indexes, which indicates that the model has a good recognition effect in the field of diabetes medicine, and can well solve the technical problem of poor recognition effect caused by the ambiguous and entity co-pointing phenomena of the diabetes medical text.
In the description of the above embodiments, the various technical features may be combined in any non-contradictory manner. In the interest of brevity, not all of the features of a particular embodiment are described in the context of a single embodiment. However, as long as the combination of these technical features does not cause contradiction, they should be considered to be within the scope of the present specification.
It is apparent that the above examples are only for the purpose of more clearly illustrating the present invention, and are not to be construed as limiting the embodiments of the present invention. Those skilled in the art can make various changes or modifications in form based on the above description. The present description is not intended to be exhaustive of all embodiments. Such modifications, substitutions and improvements are intended to be included within the spirit and scope of the invention as defined by the claims.

Claims (8)

1. The method for identifying the diabetes medical text named entity is characterized by comprising the following steps of:
step 1, acquiring corpus of related diabetes medical texts in a network, and preprocessing the corpus to construct a corpus; the pretreatment process comprises the following steps: word segmentation is carried out on the text which is not marked, stop word processing is removed, and a corpus is obtained;
step 2, marking the data in the language database, wherein the data marking adopts a BIOES marking mode to mark, and a data set is generated;
step 3, acquiring semantic vectors of word level by using a RoBERTa-wwm pre-training model;
step 4, inputting the word-level semantic vector obtained in the step 3 by RoBERTa-wwm to a local context sensing module to capture local feature information of the text;
step 5, the local characteristic information obtained by the local context sensing module in the step 4 is sent into the BiGRU combined with the self-attention mechanism layer to capture global characteristic information;
and 6, aiming at global characteristic information output by the BiGRU combined with the self-attention mechanism layer, learning the dependency relationship between adjacent labels by utilizing a conditional random field to obtain an optimal label sequence, and further completing identification of the diabetes medical text entity.
2. The method for identifying a named entity of a medical text for diabetes according to claim 1, wherein the step 1 comprises the steps of performing data cleaning on the acquired corpus, including removing space symbols, line feed symbols, carriage return symbols, tab symbols, and removing repeated contents; word segmentation is carried out on sentences in the text; and removing the stop words.
3. The method for identifying the named entity of the medical text of the diabetes mellitus according to claim 1, wherein the labeling of the corpus data constructed in the step 1 in the step 2 comprises dividing the medical entity into six entities of examination indexes, drug names, adverse reactions, parts, operations and diseases; the marking of the entities in the data set by using the BIOES marking method comprises that B represents the beginning of the entity, I represents the middle of the entity, E represents the end of the entity, S represents a single character, the entity is an entity, O represents a non-entity and does not belong to any type.
4. The method of claim 1, wherein the RoBERTa-wwm pre-training model in step 3 is used to learn a plurality of language representations and word-level semantic representations.
5. The method of claim 1, wherein the local context awareness module in step 4 comprises: the recognition local context sensing module is formed by combining a multi-window attention mechanism and a multi-scale residual convolution neural network, wherein the multi-window attention mechanism can effectively capture important semantic components of local features under windows with different sizes; inputting the important semantic components of the obtained local features into a multi-scale residual convolution network, extracting local feature information under different scales by setting a plurality of convolution kernels with different sizes, and fusing the local semantic information under different scales by using a residual structure; the specific flow is as follows: except for the first layer of CNN, the input of each layer of CNN is a fusion feature vector obtained by connecting the input and the output of the last layer of CNN through residual errors, and finally the feature vector output by each layer of CNN is spliced, so that semantic information of the local context sensing module is obtained; semantic information obtained by the local context awareness module is input into the BiGRU combined with the self-attention mechanism layer for training.
6. The method for identifying a named entity of a medical text for diabetes according to claim 1, wherein in step 5, the biglu combined with the self-attention mechanism layer is used for capturing global feature information, wherein the self-attention mechanism is used for screening key information in an input text, and solving the problem of long-distance dependence of the biglu layer; the BiGRU layer comprises an update gate and a reset gate, wherein the update gate Z t Information h for controlling its gating state to decide how much of the previous moment t-1 Will be transferred to the current time h t And from the candidate state according to its gating state
Figure QLYQS_1
Selectively receiving information; the reset gate r t It is responsible for managing how candidates are to be madeStatus->
Figure QLYQS_2
And at last moment h t-1 Is fused with the information of the (a); the calculation formula of each state of the GRU unit in the BiGRU layer is as follows:
Figure QLYQS_3
,/>
Figure QLYQS_4
,/>
Figure QLYQS_5
,/>
Figure QLYQS_6
wherein x is t And (3) inputting information at the current moment t, wherein W is a weight matrix, sigma is a sigmoid function, and tanh is a tanh function.
7. The method for recognizing a named entity of a diabetic medical text according to claim 1, wherein in step 6, the conditional random field calculates the probability of each tag sequence using the learned relationship between adjacent tags, and outputs the tag sequence having the highest probability, thereby determining the tag class of each character.
8. The method for identifying a named entity of a medical text for diabetes according to claim 7, wherein the probability score formula of the output tag sequence y given the input sequence x is:
Figure QLYQS_7
wherein A is a transfer matrix, A y(i-1),yi Denoted as label y (i-1) Transfer to y i Probability of P i,yi Expressed as the ith character mark y of the input sequence x i Probability of the tag; the conditional probability distribution formula for sequence y is:
Figure QLYQS_8
in the training process of the conditional random field, a maximum likelihood method is cited to maximize the probability of the correct label sequence y', and the calculation formula is as follows: />
Figure QLYQS_9
Finally, the Viterbi algorithm is utilized to obtain a label sequence with the highest score, namely a globally optimal result output by the conditional random field, and the calculation formula is as follows: />
Figure QLYQS_10
CN202310616459.3A 2023-05-29 2023-05-29 Diabetes medical text named entity recognition method Withdrawn CN116341557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310616459.3A CN116341557A (en) 2023-05-29 2023-05-29 Diabetes medical text named entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310616459.3A CN116341557A (en) 2023-05-29 2023-05-29 Diabetes medical text named entity recognition method

Publications (1)

Publication Number Publication Date
CN116341557A true CN116341557A (en) 2023-06-27

Family

ID=86888057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310616459.3A Withdrawn CN116341557A (en) 2023-05-29 2023-05-29 Diabetes medical text named entity recognition method

Country Status (1)

Country Link
CN (1) CN116341557A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756624A (en) * 2023-08-17 2023-09-15 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing
CN117973393A (en) * 2024-03-28 2024-05-03 苏州系统医学研究所 Accurate semantic comparison method and system for key medical information in medical text
CN118133830A (en) * 2024-04-30 2024-06-04 北京壹永科技有限公司 Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium
CN118194870A (en) * 2024-05-16 2024-06-14 中南大学 Chinese medicine named entity recognition method, device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN112560478A (en) * 2020-12-16 2021-03-26 武汉大学 Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
CN114169330A (en) * 2021-11-24 2022-03-11 匀熵教育科技(无锡)有限公司 Chinese named entity identification method fusing time sequence convolution and Transformer encoder
CN115329765A (en) * 2022-08-12 2022-11-11 江西理工大学 Method and device for identifying risks of listed enterprises, electronic equipment and storage medium
US20230053148A1 (en) * 2021-08-11 2023-02-16 Tencent America LLC Extractive method for speaker identification in texts with self-training
CN115859978A (en) * 2022-11-08 2023-03-28 浙江科技学院 Named entity recognition model and method based on Roberta radical enhanced adapter

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN112560478A (en) * 2020-12-16 2021-03-26 武汉大学 Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
US20230053148A1 (en) * 2021-08-11 2023-02-16 Tencent America LLC Extractive method for speaker identification in texts with self-training
CN114169330A (en) * 2021-11-24 2022-03-11 匀熵教育科技(无锡)有限公司 Chinese named entity identification method fusing time sequence convolution and Transformer encoder
CN115329765A (en) * 2022-08-12 2022-11-11 江西理工大学 Method and device for identifying risks of listed enterprises, electronic equipment and storage medium
CN115859978A (en) * 2022-11-08 2023-03-28 浙江科技学院 Named entity recognition model and method based on Roberta radical enhanced adapter

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIN SHI ET AL.: "Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition", 《MATHEMATICS》, pages 1 - 5 *
李云想 等: "基于RoBERTa-WWM-BiGRU-CRF的中文命名实体识别", 《南宁师范大学学报(自然科学版)》, vol. 40, no. 1, pages 72 - 78 *
集川: ""基于深度学习的相似文本匹配技术研究与应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, pages 1 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756624A (en) * 2023-08-17 2023-09-15 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing
CN116756624B (en) * 2023-08-17 2023-12-12 中国民用航空飞行学院 Text classification method for civil aviation supervision item inspection record processing
CN117973393A (en) * 2024-03-28 2024-05-03 苏州系统医学研究所 Accurate semantic comparison method and system for key medical information in medical text
CN117973393B (en) * 2024-03-28 2024-06-07 苏州系统医学研究所 Accurate semantic comparison method and system for key medical information in medical text
CN118133830A (en) * 2024-04-30 2024-06-04 北京壹永科技有限公司 Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium
CN118194870A (en) * 2024-05-16 2024-06-14 中南大学 Chinese medicine named entity recognition method, device, electronic equipment and storage medium
CN118194870B (en) * 2024-05-16 2024-08-13 中南大学 Chinese medicine named entity recognition method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Zhu et al. CAN-NER: Convolutional attention network for Chinese named entity recognition
WO2021139424A1 (en) Text content quality evaluation method, apparatus and device, and storage medium
CN116341557A (en) Diabetes medical text named entity recognition method
Campos et al. Biomedical named entity recognition: a survey of machine-learning tools
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN111160031A (en) Social media named entity identification method based on affix perception
CN113190656B (en) Chinese named entity extraction method based on multi-annotation frame and fusion features
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN111460824B (en) Unmarked named entity identification method based on anti-migration learning
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN111243699A (en) Chinese electronic medical record entity extraction method based on word information fusion
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
Arvanitis et al. Translation of sign language glosses to text using sequence-to-sequence attention models
CN114970536B (en) Combined lexical analysis method for word segmentation, part-of-speech tagging and named entity recognition
CN115238693A (en) Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN111523320A (en) Chinese medical record word segmentation method based on deep learning
CN117217233A (en) Text correction and text correction model training method and device
CN113536799B (en) Medical named entity recognition modeling method based on fusion attention
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
Shi et al. Understanding patient query with weak supervision from doctor response
CN114970537B (en) Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy
Zhang et al. Medical named entity recognition based on overlapping neural networks
CN115859978A (en) Named entity recognition model and method based on Roberta radical enhanced adapter
Cai et al. HCADecoder: a hybrid CTC-attention decoder for chinese text recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230627