CN116341556A - Small sample rehabilitation medical named entity identification method and device based on data enhancement - Google Patents

Small sample rehabilitation medical named entity identification method and device based on data enhancement Download PDF

Info

Publication number
CN116341556A
CN116341556A CN202310612923.1A CN202310612923A CN116341556A CN 116341556 A CN116341556 A CN 116341556A CN 202310612923 A CN202310612923 A CN 202310612923A CN 116341556 A CN116341556 A CN 116341556A
Authority
CN
China
Prior art keywords
rehabilitation medical
named entity
case data
data
medical case
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310612923.1A
Other languages
Chinese (zh)
Inventor
陈博
孟过
刘炯
王剑斌
沈怡俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310612923.1A priority Critical patent/CN116341556A/en
Publication of CN116341556A publication Critical patent/CN116341556A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method and a device for identifying a small sample rehabilitation medical named entity based on data enhancement, wherein the method comprises the following steps: acquiring initial rehabilitation medical case data, dividing named entities, and performing BIOS labeling on the rehabilitation medical case data divided by the named entities; performing data enhancement on the rehabilitation medical case data divided by the named entity to obtain rehabilitation medical case data with new labels; comprising the following steps: analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, and carrying out random mask on different named entities in the rehabilitation medical case data; and/or randomly replacing named entities in the rehabilitation medical case data among named entity types of the same type; and inputting the initial rehabilitation medical case data and the rehabilitation medical case data with the new label into a named entity recognition network to obtain a rehabilitation medical named entity recognition result.

Description

Small sample rehabilitation medical named entity identification method and device based on data enhancement
Technical Field
The invention relates to the technical fields of data enhancement, named entity recognition, BIOS labeling and the like, in particular to a method and a device for recognizing a small sample rehabilitation medical named entity based on data enhancement.
Background
In the modern society with increasingly developed medicine, many diseases still have life safety seriously threatening human beings, wherein, the stroke has become the first death cause in China due to the characteristics of high morbidity, high disability rate, high death rate and high recurrence rate, and is also the primary cause of disability of adults in China. Therefore, recovery of limb movement functions of a patient suffering from cerebral apoplexy is an important link for rehabilitation of the patient. With the rapid development of artificial intelligence, technologies for assisting rehabilitation diagnosis, planning or assisting treatment process by a deep learning method are emerging. However, the training process of the depth model often requires a large amount of calibration data, while the real acquired data is usually structured, semi-structured, or unstructured data, which restricts the training process of the depth model in terms of data structure and data quality. Structured data generally refers to data that can be logically implemented in two-dimensional tables; the semi-structured data does not conform to the form of a two-dimensional table, but contains associated markers; unstructured data does not have fixed structured data, such as case text.
In practical application, compared with the other two types, the structured data has the advantages of rare quantity and higher acquisition cost, and the problems are particularly serious in the professional fields such as the rehabilitation medicine field. By establishing a named entity recognition network, structured information such as entities, relations, entity attributes and the like can be automatically extracted from semi-structured and unstructured data, so that the problems that the structured data is small in data size and difficult to acquire in actual situations can be effectively solved. In the above process, entity extraction is one of the key technologies. Entity extraction is also called named entity recognition, and positions and classifies important nouns and proper nouns in text, which can be called named entities, and the named entities can be artificially set according to different downstream tasks.
Named entity recognition is the basis for many downstream tasks, and typically the accuracy and effect of named entity recognition determines the effect of different downstream tasks. There are many deep learning network frameworks for named entity recognition, however, there are significant shortcomings to the training process of these deep networks: a) In the training process of the deep neural network model, a large number of effective label data fitting models in the medical field are needed; b) In practical situations, it is difficult to train a named entity recognition network from scratch, considering the data and computational demands of training the named entity recognition neural network. Particularly, when a knowledge graph is constructed in the professional fields of rehabilitation and medical treatment and the like, the acquisition of professional medical label data is difficult or the acquisition cost is high, and the general deep learning network framework is difficult to train to assist in the task of extracting the named entities of unstructured data.
Therefore, a small sample rehabilitation medical named entity recognition method based on data enhancement is provided, and the method is applied to recognition of data in the field of rehabilitation medical treatment.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method and a device for identifying a small sample rehabilitation medical named entity based on data enhancement.
According to a first aspect of an embodiment of the present invention, there is provided a method for identifying a small sample rehabilitation medical named entity based on data enhancement, the method comprising:
acquiring initial rehabilitation medical case data, dividing named entities, and performing BIOS labeling on the rehabilitation medical case data divided by the named entities;
performing data enhancement on the rehabilitation medical case data divided by the named entity to obtain rehabilitation medical case data with new labels; comprising the following steps:
analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, and carrying out random mask on different named entities in the rehabilitation medical case data;
and/or the number of the groups of groups,
randomly replacing named entities in the rehabilitation medical case data among named entity types of the same type;
and inputting the initial rehabilitation medical case data and the rehabilitation medical case data with the new label into a named entity recognition network to obtain a rehabilitation medical named entity recognition result.
According to a second aspect of the embodiment of the present invention, a small sample rehabilitation medical named entity identification device based on data enhancement is provided, which includes one or more processors, and is configured to implement the small sample rehabilitation medical named entity identification method based on data enhancement.
According to a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon a program which, when executed by a processor, is configured to implement the above-described method for identifying a small sample rehabilitation medical named entity based on data enhancement.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a small sample rehabilitation medical named entity identification method based on data enhancement, which is used for generating additional effective label rehabilitation medical case data by a random mask and/or a random replacement data enhancement mode and supplementing data under the condition of lacking enough effective label rehabilitation medical case data; the enhanced rehabilitation medical case data are input into a pre-training model for named entity recognition, and a named entity recognition network model is adapted to the rehabilitation medical field in the example through a fine tuning means, so that medical information in the rehabilitation medical case data is extracted. Under the condition of a small sample, a large amount of effective label data can be generated by the random mask and/or the random replacement data enhancement mode in the invention, so that the recognition precision of the named entity is improved, and the named entity in the rehabilitation medical text data is extracted more effectively.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flowchart of a method for identifying a small sample rehabilitation medical named entity based on data enhancement provided by an embodiment of the invention;
fig. 2 is a schematic structural diagram of a method for identifying a named entity of small sample rehabilitation medical treatment based on data enhancement according to an embodiment of the present invention;
FIG. 3 is a diagram of initial rehabilitation medical case data and named entity classification according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of rehabilitation medical case data after using a random mask according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of rehabilitation medical case data after random replacement according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of rehabilitation medical case data using random substitution in combination with a random mask, provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a named entity recognition network structure according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a named entity recognition network for extracting named entity results according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a small sample rehabilitation medical named entity recognition device based on data enhancement according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The features of the following examples and embodiments may be combined with each other without any conflict.
Referring to fig. 1 and 2, an embodiment of the present invention provides a method for identifying a small sample rehabilitation medical named entity based on data enhancement, the method comprising the following steps:
step S1, initial rehabilitation medical case data are obtained, named entity division is carried out, and BIOS labeling is carried out on the rehabilitation medical case data after the named entity division.
After the medical field in this example obtains valid labeled rehabilitation medical case data, the rehabilitation medical case data format and named entity information in the rehabilitation medical case data are analyzed. The data storage format is json format, wherein the rehabilitation medical case data contains 14 types of named entities, and the named entities are important information or useful information of text paragraphs. The named entity names and corresponding English categories contained in the rehabilitation medical case data are respectively as follows: name: a name; gender: sex; age: carrying out an age; diagnosis of disease name: a break; the course of the disease: plurse; affecting the limb: AL; basic disease/other diseases: UOD; clinical manifestations: CM; quantized values: scale; rehabilitation equipment: a device; treatment time: the events; other devices/treatments: ODT; before use: pre; after use: and (5) post.
In addition, each character in the rehabilitation medical case data is a single character, and the characters are not related, but in actual conditions, the characters and the characters form words which accord with actual semantic information, because the words need to be marked, the invention adopts a BIOS marking method, and B, namely Begin, represents the first character of a forming entity; i, i.e., an instruction, represents the other characters of the constituent entity than the first character; o, other, represents a non-physical character, for marking irrelevant characters; s, single, represents a Single entity character.
The list formed by characters or character strings is obtained through the BIOS labeling method, before the network training or testing of the named entity recognition, a word list and a label list are also required to be constructed, and the original characters and labels are respectively mapped into index positions in the word list and the label list.
Illustratively, as shown in fig. 3, an example of rehabilitation medical case data is provided as "Chen Dage, 35 years old, 2022, 10 and 29 days old, showing a decrease in muscle tone, further showing movement disorder, checking as cerebral infarction, and cerebral edema of surrounding tissues. The section of rehabilitation medical case data contains five types of named entities. The types of the named entities corresponding to Chen Dage are names and names, 35 years old are ages age, 2022 is 10 months 29 days old, disease course is course, dyskinesia is clinical manifestation CM, and cerebral infarction is diagnosis disease. Taking "dyskinesia" as an example, B-CM represents the initial position of the entity, I-CM represents other characters of the entity except the character represented by the initial position, and B-CM and subsequent consecutive I-CM together form a label of "dyskinesia", which represents that "dyskinesia" belongs to the CM type of named entity. The information except the named entity, that is, important information, in the rehabilitation medical case data is marked by O, which represents that the character is irrelevant information in the process of analyzing the quantitative index of the patient condition, for example, the "muscle tension is reduced" in fig. 1, etc.
Step S2, carrying out data enhancement on the rehabilitation medical case data divided by the named entity to obtain rehabilitation medical case data with new labels; comprising the following steps:
analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, and carrying out random mask on different named entities in the rehabilitation medical case data;
and/or the number of the groups of groups,
named entities in the rehabilitation medical case data are randomly replaced among named entity types of the same type.
In particular, the object of the named entity recognition network is to extract named entities classified by the invention from unstructured text, namely important information focused in the rehabilitation medical field relevant to the invention. Meanwhile, in order to ensure the integrity of unstructured text semantic information and structural information, the invention designs a data enhancement method only aiming at named entities in the rehabilitation medical case data, and does not carry out data enhancement on the whole rehabilitation medical case data. The data enhancement method provided by the embodiment of the invention is further described below.
(A) And analyzing the length of each named entity in the named entity divided rehabilitation medical case data, and carrying out random masking on different named entities in the rehabilitation medical case data to obtain the rehabilitation medical case data with new labels.
Further, analyzing the length of each named entity in the rehabilitation medical case data after the named entity division, and carrying out random masking on different named entities in the rehabilitation medical case data comprises:
analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, setting the average covering rate of the entities, and utilizing
Figure SMS_1
The symbols randomly mask the contents of different named entities in the rehabilitation medical case data.
It should be noted that, in many cases, the rehabilitation medical data obtained in this example is incomplete, for example, the text lacks characters, resulting in poor semantics. In view of such circumstances, the present invention performs a masking operation for key information, namely named entities, in the rehabilitation medical text, namely using rare'
Figure SMS_3
The symbol randomly masks the content of the named entity. Although the random masking method damages semantic information of the rehabilitation medical data to a certain extent, the random masking method is used for naming entity recognition tasks for small samples, so that the rehabilitation medical data can be more in line with actual conditions on one hand, and on the other hand, high-quality and complete brand-new data can be generated according to initial rehabilitation medical case data, and the problem that samples are too few in neural network training is solved. In this regard, in this example, each named entity content was analyzed, taking into account that the design entity average mask rate was 25%. Illustratively, "Chen Dage" is "old +.after being randomly masked" as shown in FIG. 4>
Figure SMS_6
"35 years old" is "3 +.>
Figure SMS_8
Age ","2022, 10, 29 "is" 20->
Figure SMS_4
Figure SMS_5
Annual->
Figure SMS_7
After random masking, 0 month 29 day "," dyskinesia "is" sports +.>
Figure SMS_9
The obstruction and cerebral infarction are treated by random masking to form brain +.>
Figure SMS_2
Dead). Experiments prove that the masking rate can mask entities with different lengths to different degrees, and the masking effectiveness is ensured on the basis of saving semantic information as much as possible.
(B) Named entities in the rehabilitation medical case data are randomly replaced among named entity types of the same type.
And classifying the named entities of the classified rehabilitation medical case data, and randomly exchanging the named entity types of the same type to obtain the rehabilitation medical case data with new labels.
Illustratively, as shown in fig. 5, "Chen Dage" is "Ban Qin good" after random exchange, "75 years" after random exchange, "10 months 29 days" in 2022 "is" 07 months 01 days in 2021 "after random exchange," healthy side leg flexion during walking "after random exchange," and "epileptic" after random exchange.
It should be noted that, in this example, the medical field data generally has randomness, and for the same disease, the actual disease condition of each person is different, so in the case of a small sample, the disease condition of the same disease cannot be covered completely, and considering this situation, the invention performs random replacement for the named entity with the same type in the rehabilitation medical case data. Although the generated rehabilitation medical case data with the new label is often not in accordance with the actual situation logically, in the case of a small sample, the influence of the lack of data information on the training result of the neural network is larger. Meanwhile, experiments prove that in the named entity network training process, the influence of the association and the appearance sequence among the entities on the performance index of the network is small. By using the random replacement method, on one hand, the generated rehabilitation medical case data with the new label and the initial rehabilitation medical case data are combined to more completely cover the situation of the medical data in the example, so that the basic pathological information is more complete, on the other hand, more updated data are generated, and the problem of too few samples in the training of the neural network is solved.
(C) And combining the random mask with random replacement, and carrying out data enhancement on the rehabilitation medical case data divided by the named entity to obtain rehabilitation medical case data with new labels.
Exemplary, as shown in FIG. 6, "Chen Dage" is "class" after being randomly masked and randomly replaced
Figure SMS_11
Preferably, "35 years old" is "75 +.>
Figure SMS_13
","2022, 10, 29 days "is" ++after random masking and random replacement>
Figure SMS_15
021->
Figure SMS_12
7 months 01%>
Figure SMS_14
"dyskinesia" is "line +.after random masking and random replacement->
Figure SMS_16
Shi Jian->
Figure SMS_17
Leg flexion "," cerebral infarction "after random masking and random replacement is>
Figure SMS_10
Epilepsy is obtained.
It should be noted that, the combination of the random mask and the random replacement method is used, and on the basis of the random replacement, the random mask method is used to replace the data with covering capability for various symptoms in the actual situation of the example, so that the enhanced rehabilitation medical case data more accords with the quality of the rehabilitation medical data in the actual situation.
And step S3, constructing a named entity recognition network, and inputting the initial rehabilitation medical case data and the rehabilitation medical case data with the new label into the named entity recognition network to obtain a rehabilitation medical named entity recognition result.
The method further comprises the steps of: and fine tuning the named entity recognition network by using the initial rehabilitation medical case data and the rehabilitation medical case data with the new label.
Considering the data distribution of the rehabilitation medical data set used in this example, a network structure of the named entity recognition network using RoBERTa+BiLSTM+CRF is constructed. The RoBERTa obtains powerful sentence semantic extraction capability by pre-training on a large-scale corpus data set consisting of Chinese wikipedia, and can obtain a better result by combining a fine tuning mode with a named entity recognition task. The BiLSTM can fuse the context information, so that the network can learn the semantic information of sentences better. Sequence labeling can be viewed as a matter of multi-classification of sequence elements without explicit constraints between sequence tags. Named entity identification is a joint labeling task, the labels have a dependency relationship, a CRF introduces a characteristic function, the characteristics of each moment can be added when a global sequence is calculated, information is acquired more comprehensively, and a better sequence labeling effect can be obtained. And the accuracy of sequence labeling is improved.
Fig. 7 shows the network structure of roberta+bilstm+crf.
Recording the initial rehabilitation medical case data and the rehabilitation medical case data with the new label as a first text sequence
Figure SMS_18
Representing each character in the sentence, +.>
Figure SMS_19
Representing the length of sentence text and adding a start identifier [ CLS ] at the start position of the first text sequence]Inputting into Roberta network to obtain a first vector representation containing each character information +.>
Figure SMS_20
Representing the first vector
Figure SMS_21
Alignment with the initial rehabilitation medical case data to obtain a second vector representation +.>
Figure SMS_22
Representing the second vector
Figure SMS_23
Inputting the third expression into BiLSTM network to perform semantic learning and processing of the context to obtain a third expression +.>
Figure SMS_24
Representing a third vector
Figure SMS_25
Inputting the predicted sequence into a CRF layer to obtain a predicted sequence representation; the predicted sequence is mapped by the word list and the tag list to obtain a predicted tag sequence, and a rehabilitation medical named entity recognition result is obtained.
An evaluation score is calculated using the predicted tag sequence.
The evaluation score is calculated by adopting Micro-Averaging evaluation, and Micro-Averaging (Micro-Averaging) is performed by carrying out statistics on each example non-classification in the data set to establish a global confusion matrix, and then calculating corresponding indexes. The calculation formula is as follows:
Figure SMS_26
where n represents the total number of text in the sample,
Figure SMS_27
representing the number of correctly identified entities in the ith text,/>
Figure SMS_28
Representing the number of erroneously identified entities in the ith text,/->
Figure SMS_29
Indicating the number of incorrectly identified entities in the ith text. MicroP represents precision, which means the proportion of the sample with the entity correctly identified in all the samples with the entity identified; microR represents recall, also called recall, and refers to the proportion of samples that correctly identify an entity in all correct entity samples; microF represents the harmonic mean of MicroP and MicroR, the value range of MicroF is +.>
Figure SMS_30
The closer the outcome of the MicroF is to 1, the better the performance of the named entity recognition network.
And carrying out statistics on the prediction results of each entity class by using Micro-Averaging, and calculating the precision MicroP and recall MicroR. Generally, if the proportion of the number of a certain entity sample in all samples is smaller, the identification effect cannot be well reflected by using a conventional evaluation index, each entity class has an equal position by using Micro-average as the evaluation index, the influence of the number and the size of each entity on the calculation of the Micro f is balanced, the result of each sample during training is focused, and the result of the whole data set is more approximate to an objective result.
The closer the evaluation function result used by the network of the example is to 1, the better the performance of the named entity recognition network is. As shown in Table 1 below, column 1 is the named entity type defined in step 2; the 2 nd column is the initial rehabilitation medical case data input to a named entity recognition network, and the extraction precision is carried out for each named entity; the 3 rd to 5 th columns are the initial rehabilitation medical case data, the rehabilitation medical case data with new labels, which are obtained after random masking, the rehabilitation medical case data with new labels, which are obtained after random replacement, and the rehabilitation medical case data with new labels, which are obtained by combining the random masking and the random replacement, are merged and input into a named entity recognition network, and the extraction precision is carried out for each named entity; and (5) carrying out thickening marking on the optimal precision extracted from the same named entity every 1 line.
Table 1: named entity identification evaluation result table
Figure SMS_31
As can be seen from analysis table 1, the method for identifying the named entity of the rehabilitation medical treatment based on the small sample with data enhancement provided by the invention can generate a new effective data label aiming at the problem of lack of effective label data in the rehabilitation medical treatment field of the example, and has obvious advantages of realizing higher-precision named entity identification under the condition of a small quantity of rehabilitation medical treatment data.
As shown in FIG. 8, the original case information is "Liu x, man, 41 years old cerebral infarction, left lower limb organism weakness for more than one month". Before use: the muscle strength of the left lower limb is 3 level, the standing position is balanced by 2 level, and other people assist in walking downwards. The lower limb exoskeleton robot has 1 course of treatment and is matched with PT manipulation for treatment. After use: the muscle strength of the left lower limb is 4 level, the standing position is balanced three levels, and the left lower limb walks independently. "the case information after data enhancement is input to a named entity recognition network," Liu x "is extracted as" name ", and" Man "is extracted as" gender: sex ","41 years "extract as" age: age "," cerebral infarction "is extracted as" diagnostic disease name: treatment ","1 course "extracts as" treatment time: the lower limb exoskeleton robot is extracted as rehabilitation equipment: device "," treatment with PT manipulation "is extracted as" other devices/treatments: ODT "," muscle strength 3 scale "is extracted as" quantized value: scale "," left lower limb "extracts as" influencing limb: AL "," before use "is extracted as" before use: pre "," post-use "extract as" post-use: post "," organism weakness "," standing balance level 2 "," walking under assistance of others "," standing balance level three "," independent walking "are extracted as" clinical manifestations: CM ", can be effectual from this paragraph draw the naming entity that defines in advance, finish the information extraction to the important information of former case information.
Therefore, the small sample rehabilitation medical named entity identification method based on data enhancement provided by the invention is used for generating additional effective label data through the data enhancement method according to the existing small amount of effective label data under the condition of lacking enough effective label data; the initial rehabilitation medical case data and the newly generated rehabilitation medical case data with new labels are simultaneously input into a named entity identification network, and the named entity identification network pre-trained by using the universal identification data is applied to the rehabilitation medical field through a pre-training and fine-tuning means for extracting important information in the rehabilitation medical data in the example. Meanwhile, compared with the initial rehabilitation medical case data which is independently input into a named entity recognition network, the accuracy of the recognition of most named entities is greatly improved or leveled, and the named entities in the text data can be more effectively extracted.
Corresponding to the embodiment of the small sample rehabilitation medical named entity identification method based on data enhancement, the invention also provides an embodiment of the small sample rehabilitation medical named entity identification device based on data enhancement.
Referring to fig. 9, a device for identifying a small sample rehabilitation medical named entity based on data enhancement provided by an embodiment of the invention includes one or more processors configured to implement the method for identifying a small sample rehabilitation medical named entity based on data enhancement in the above embodiment.
The embodiment of the invention based on the data enhanced small sample rehabilitation medical named entity recognition device can be applied to any equipment with data processing capability, wherein the equipment with data processing capability can be equipment or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 9, a hardware structure diagram of an apparatus with optional data processing capability where the apparatus for identifying a small sample rehabilitation medical named entity based on data enhancement according to the present invention is shown, except for a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 9, the apparatus with optional data processing capability in the embodiment generally includes other hardware according to an actual function of the apparatus with optional data processing capability, which is not described herein.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and when the program is executed by a processor, the method for identifying the small sample rehabilitation medical named entity based on data enhancement in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. The specification and examples are to be regarded in an illustrative manner only.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims (10)

1. A method for identifying a small sample rehabilitation medical named entity based on data enhancement, the method comprising:
acquiring initial rehabilitation medical case data, dividing named entities, and performing BIOS labeling on the rehabilitation medical case data divided by the named entities;
performing data enhancement on the rehabilitation medical case data divided by the named entity to obtain rehabilitation medical case data with new labels; comprising the following steps:
analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, and carrying out random mask on different named entities in the rehabilitation medical case data;
and/or the number of the groups of groups,
randomly replacing named entities in the rehabilitation medical case data among named entity types of the same type;
and inputting the initial rehabilitation medical case data and the rehabilitation medical case data with the new label into a named entity recognition network to obtain a rehabilitation medical named entity recognition result.
2. The method for identifying a named entity of a small sample rehabilitation medical treatment based on data enhancement according to claim 1, wherein the named entity type corresponding to the rehabilitation medical treatment case data comprises:
name, sex, age, name of disease diagnosed, course of disease, affecting limb, underlying disease/other disease, clinical manifestation, quantified value, rehabilitation device, treatment time, other device/treatment, pre-use, post-use.
3. The method for identifying small sample rehabilitation medical named entity based on data enhancement according to claim 1 or 2, wherein the BIOS labeling of the named entity-divided rehabilitation medical case data comprises:
BIOS labeling is carried out on the rehabilitation medical case data divided by the named entity so as to construct a word list and a label list, and characters and labels in the rehabilitation medical case data divided by the named entity are respectively mapped into index positions in the word list and the label list;
wherein B represents a first character constituting an entity, I represents other characters constituting the entity than the first character, O represents a non-entity character, and S represents a single entity character.
4. The method for identifying small sample rehabilitation medical named entity based on data enhancement according to claim 1, wherein analyzing the length of each named entity in the rehabilitation medical case data after the named entity division, and performing random masking on different named entities in the rehabilitation medical case data comprises:
analyzing the length of each named entity in the rehabilitation medical case data divided by the named entity, setting the average covering rate of the entities, and utilizing
Figure QLYQS_1
The symbols randomly mask the contents of different named entities in the rehabilitation medical case data.
5. The data-enhanced small sample rehabilitation medical named entity identification method of claim 1, wherein randomly replacing named entities in rehabilitation medical case data between named entity types of the same type comprises:
and classifying the named entities for the classified rehabilitation medical case data, and randomly replacing the named entity types of the same type.
6. The data-enhanced small sample rehabilitation medical named entity recognition method according to claim 1, wherein the named entity recognition network consists of a Roberta network, a BiLSTM network and a CRF layer which are connected in sequence.
7. The method for identifying a small sample rehabilitation medical named entity based on data enhancement according to claim 6, wherein inputting initial rehabilitation medical case data and rehabilitation medical case data with new labels into a named entity identification network, obtaining a rehabilitation medical named entity identification result comprises:
recording initial rehabilitation medical case data and rehabilitation medical case data with new labels as a first text sequence, adding a start identifier at the start position of the first text sequence, and inputting the initial rehabilitation medical case data and the rehabilitation medical case data into a RoBERTa network to obtain a first vector representation containing each character information;
aligning the first vector representation with the initial rehabilitation medical case data to obtain a second vector representation;
inputting the second vector representation into a BiLSTM network to perform semantic learning and processing of the context, and obtaining a third vector representation;
inputting the third vector representation into the CRF layer to obtain a predicted sequence representation; the predicted sequence is mapped by the word list and the tag list to obtain a predicted tag sequence, and a rehabilitation medical named entity recognition result is obtained.
8. The method for identifying a small sample rehabilitation medical named entity based on data enhancement according to claim 6, wherein inputting initial rehabilitation medical case data and rehabilitation medical case data with new labels into a named entity identification network, obtaining a rehabilitation medical named entity identification result further comprises:
and fine tuning the named entity recognition network by using the initial rehabilitation medical case data and the rehabilitation medical case data with the new label.
9. A data enhancement based small sample rehabilitation medical named entity recognition device comprising one or more processors configured to implement the data enhancement based small sample rehabilitation medical named entity recognition method of any one of claims 1-8.
10. A computer readable storage medium having stored thereon a program which, when executed by a processor, is adapted to carry out the data enhancement based small sample rehabilitation medical named entity identification method of any of claims 1-8.
CN202310612923.1A 2023-05-29 2023-05-29 Small sample rehabilitation medical named entity identification method and device based on data enhancement Pending CN116341556A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310612923.1A CN116341556A (en) 2023-05-29 2023-05-29 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310612923.1A CN116341556A (en) 2023-05-29 2023-05-29 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Publications (1)

Publication Number Publication Date
CN116341556A true CN116341556A (en) 2023-06-27

Family

ID=86879093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310612923.1A Pending CN116341556A (en) 2023-05-29 2023-05-29 Small sample rehabilitation medical named entity identification method and device based on data enhancement

Country Status (1)

Country Link
CN (1) CN116341556A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257441A (en) * 2020-09-15 2021-01-22 浙江大学 Named entity identification enhancement method based on counterfactual generation
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN113947086A (en) * 2021-10-26 2022-01-18 北京沃东天骏信息技术有限公司 Sample data generation method, training method, corpus generation method and apparatus
CN114372465A (en) * 2021-09-29 2022-04-19 武汉工程大学 Legal named entity identification method based on Mixup and BQRNN
CN114611513A (en) * 2022-01-19 2022-06-10 达闼机器人股份有限公司 Sample generation method, model training method, entity identification method and related device
CN114638214A (en) * 2022-03-18 2022-06-17 中国人民解放军国防科技大学 Method for identifying Chinese named entities in medical field
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN115310446A (en) * 2022-08-03 2022-11-08 湖南中医药大学 Traditional Chinese medicine ancient book named entity identification method and device, electronic equipment and memory

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257441A (en) * 2020-09-15 2021-01-22 浙江大学 Named entity identification enhancement method based on counterfactual generation
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN114372465A (en) * 2021-09-29 2022-04-19 武汉工程大学 Legal named entity identification method based on Mixup and BQRNN
CN113947086A (en) * 2021-10-26 2022-01-18 北京沃东天骏信息技术有限公司 Sample data generation method, training method, corpus generation method and apparatus
CN114611513A (en) * 2022-01-19 2022-06-10 达闼机器人股份有限公司 Sample generation method, model training method, entity identification method and related device
CN114638214A (en) * 2022-03-18 2022-06-17 中国人民解放军国防科技大学 Method for identifying Chinese named entities in medical field
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN115310446A (en) * 2022-08-03 2022-11-08 湖南中医药大学 Traditional Chinese medicine ancient book named entity identification method and device, electronic equipment and memory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FLYING_SFENG: "文本数据增强方法总结", HTTPS://BLOG.CSDN.NET/FLYING_SFENG/ARTICLE/DETAILS/121691380 *

Similar Documents

Publication Publication Date Title
Yin et al. Chinese clinical named entity recognition with radical-level feature and self-attention mechanism
CN112242187B (en) Medical scheme recommendation system and method based on knowledge graph characterization learning
CN106874643B (en) Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN112786194A (en) Medical image diagnosis guide inspection system, method and equipment based on artificial intelligence
CN110069779B (en) Symptom entity identification method of medical text and related device
Fei et al. A tree-based neural network model for biomedical event trigger detection
CN110427486B (en) Body condition text classification method, device and equipment
CN113204969A (en) Medical named entity recognition model generation method and device and computer equipment
Greenwood et al. Improving semi-supervised acquisition of relation extraction patterns
Cao et al. Clinical-coder: Assigning interpretable ICD-10 codes to Chinese clinical notes
Sifa et al. Towards contradiction detection in german: a translation-driven approach
Michalopoulos et al. ICDBigBird: a contextual embedding model for ICD code classification
Varvara et al. Grounding semantic transparency in context: A distributional semantic study on German event nominalizations
CN115310446A (en) Traditional Chinese medicine ancient book named entity identification method and device, electronic equipment and memory
CN116911300A (en) Language model pre-training method, entity recognition method and device
CN109299467A (en) Medicine text recognition method and device, sentence identification model training method and device
CN109726404B (en) Training data enhancement method, device and medium of end-to-end model
CN111627561B (en) Standard symptom extraction method, device, electronic equipment and storage medium
Yao et al. Factuality assessment as modal dependency parsing
CN112347773A (en) Medical application model training method and device based on BERT model
CN116341556A (en) Small sample rehabilitation medical named entity identification method and device based on data enhancement
CN114064938B (en) Medical literature relation extraction method and device, electronic equipment and storage medium
Lin et al. Neural decoding of speech with semantic-based classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination