WO2024119773A1 - 一种文本的标注方法、装置、电子设备及可读存储介质 - Google Patents

一种文本的标注方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2024119773A1
WO2024119773A1 PCT/CN2023/101690 CN2023101690W WO2024119773A1 WO 2024119773 A1 WO2024119773 A1 WO 2024119773A1 CN 2023101690 W CN2023101690 W CN 2023101690W WO 2024119773 A1 WO2024119773 A1 WO 2024119773A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
labeled
annotated
confidence
language model
Prior art date
Application number
PCT/CN2023/101690
Other languages
English (en)
French (fr)
Inventor
周镇镇
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024119773A1 publication Critical patent/WO2024119773A1/zh

Links

Definitions

  • Some embodiments of the present application relate to the field of artificial intelligence technology, and in particular, to a text annotation method, a text annotation device, an electronic device, and a non-volatile readable storage medium.
  • large-scale pre-trained language models have a wide range of general basic knowledge and have obvious advantages in natural language dialogue, small talk, open domain question and answer, reading comprehension, etc.
  • large-scale pre-trained language models also have shortcomings.
  • the technology is relatively lacking in professional knowledge in different industry fields, so there are certain difficulties in actual projects; on the other hand, due to the huge number of parameters of large-scale pre-trained language models, the inference time is relatively long and cannot meet high-frequency concurrency requirements.
  • Some embodiments of the present application provide a text annotation method, device, electronic device and non-volatile readable storage medium to solve or partially solve the problem that the current method of using pre-trained language models to annotate text files based on professional knowledge in different industry fields is technically deficient, inefficient, costly and time-consuming, and cannot meet high-frequency concurrent needs.
  • Some embodiments of the present application disclose a text annotation method, the method comprising:
  • an initial labeling data set is obtained; wherein the initial labeling data set includes samples to be labeled;
  • the initial labeled dataset, the first labeled dataset, the mining difficult sample dataset and the enhanced dataset are mixed to obtain the target labeled dataset of the samples to be labeled.
  • inputting the samples to be labeled in the initial labeled data set into a pre-trained language model that has been pre-trained, and outputting a first labeled data set for the samples to be labeled includes:
  • the first prompt of the pre-built pre-trained language model is input as data into the pre-trained language model to obtain a first labeled data set of samples to be labeled.
  • the first prompt of the pre-trained language model consists of a labeling task, a first case, and a sample to be labeled.
  • the first case is a case of performing a labeling task, and the first case consists of a labeled text, a labeling task,
  • the labeled text contains named entities.
  • the sample to be labeled consists of the text to be labeled and the labeling task.
  • the text to be labeled contains named entities.
  • inputting the samples to be labeled in the initial labeled data set into a pre-trained language model that has been pre-trained, and outputting a first labeled data set for the samples to be labeled includes:
  • the labeled named entities and the to-be-labeled texts are combined into data pairs, and a plurality of data pairs constitute a first labeled data set of the to-be-labeled samples.
  • the named entity includes at least the symptom subject, modifications of the symptom subject, symptom description, modifications of the symptom description, examination item names, and examination results, and the modifications of the symptom description include at least nature, frequency, time, condition, and degree.
  • the preset screening network includes an input layer, an embedding layer, a long short-term memory layer, an attention layer, a conditional random field layer, and a classification network layer, wherein the embedding layer includes a confidence level for dividing the first labeled data set into confidence data sets corresponding to different confidence levels.
  • the confidence dataset includes a high confidence dataset, a medium confidence dataset, and a low confidence dataset.
  • the first annotated data set is input into a preset screening network, and the first annotated data set is divided into confidence data sets corresponding to different confidence levels, including:
  • the first labeled data set is located in a high confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a high confidence data set;
  • the first annotated data set is located in a medium confidence region preset by the embedding layer in the screening network, the first annotated data set is divided into a medium confidence data set;
  • the first labeled data set is located in a low confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a low confidence data set.
  • data enhancement is performed on a data set of samples that are difficult to mine to obtain an enhanced data set, including:
  • All data in the mining difficult sample data set are input into the second prompt of the pre-trained pre-trained language model, and the second prompt of the pre-trained language model is input into the pre-trained language model as data to obtain an enhanced data set.
  • data enhancement is performed on the mining difficult sample data set to obtain an enhanced data set, and the method further includes:
  • the second prompt of the pre-trained language model consists of an enhanced task, a second case, and a sample to be enhanced.
  • the second case is a case of performing an enhanced task, and the second case consists of annotated text, an enhanced task, and similar sentences.
  • the mining difficult sample data set includes text to be enhanced, and the mining difficult sample data set is subjected to data mining.
  • the enhanced data set is obtained, including:
  • the method further includes:
  • the enhanced data set is acquired multiple times to obtain multiple enhanced data sets.
  • the method further includes:
  • the step of returning to input the samples to be labeled in the initial labeled data set into the pre-trained language model that has been pre-trained, and outputting the first labeled data set for the samples to be labeled is performed.
  • the initial annotated dataset, the first annotated dataset, the mining difficult sample dataset and the enhanced dataset are mixed to obtain a target annotated dataset of samples to be annotated, including:
  • the number of acquisition times of the mixed enhanced data set is determined, and when the number of acquisition times of the enhanced data set is greater than the preset execution times, the target labeled data set of the sample to be labeled is output.
  • Some embodiments of the present application also disclose a text annotation device, the device comprising:
  • An initial text data set collection module used to collect an initial text data set
  • An initial annotated data set acquisition module is used to obtain an initial annotated data set in response to an annotation instruction operation on an initial text data set; wherein the initial annotated data set includes samples to be annotated;
  • a first annotated data set acquisition module used to input the samples to be annotated in the initial annotated data set into a pre-trained language model that has been pre-trained, and output a first annotated data set for the samples to be annotated;
  • a confidence data set division module used to input the first annotated data set into a preset screening network, and divide the first annotated data set into confidence data sets corresponding to different confidences;
  • a mining difficult sample data set acquisition module is used to obtain a mining difficult sample data set in response to a labeling instruction operation for a confidence data set with a confidence level lower than a preset confidence level;
  • An enhanced data set acquisition module is used to perform data enhancement on the difficult sample data set to obtain an enhanced data set
  • the target annotation data set acquisition module is used to mix the initial annotation data set, the first annotation data set, the mining difficult sample data set and the enhanced data set to obtain the target annotation data set of the samples to be annotated.
  • the first annotated data set acquisition module is specifically used to:
  • the first prompt of the pre-built pre-trained language model is input as data into the pre-trained language model to obtain a first labeled data set of samples to be labeled.
  • the first annotated data set acquisition module is specifically used to:
  • the labeled named entities and the to-be-labeled texts are combined into data pairs, and a plurality of data pairs constitute a first labeled data set of the to-be-labeled samples.
  • the confidence data set partitioning module is specifically used to:
  • the first labeled data set is located in a high confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a high confidence data set;
  • the first annotated data set is located in a medium confidence region preset by the embedding layer in the screening network, the first annotated data set is divided into a medium confidence data set;
  • the first labeled data set is located in a low confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a low confidence data set.
  • the enhanced data set acquisition module is specifically used to:
  • All data in the mining difficult sample data set are input into the second prompt of the pre-trained pre-trained language model, and the second prompt of the pre-trained language model is input into the pre-trained language model as data to obtain an enhanced data set.
  • the enhanced data set acquisition module is further configured to:
  • the mining difficult sample data set includes text to be enhanced, and the enhanced data set acquisition module is specifically used to:
  • the target annotation dataset acquisition module is specifically used to:
  • the number of acquisition times of the mixed enhanced data set is determined, and when the number of acquisition times of the enhanced data set is greater than the preset execution times, the target labeled data set of the sample to be labeled is output.
  • Some embodiments of the present application also disclose an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
  • Memory used to store computer programs
  • the processor is used to implement the methods of some embodiments of the present application when executing the program stored in the memory.
  • Some embodiments of the present application also disclose a non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, enable the processors to execute methods as described in some embodiments of the present application.
  • an initial text data set is first collected, and then an initial annotated data set is obtained in response to an annotation instruction operation on the initial text data set, wherein the initial annotated data set includes samples to be annotated, and then the samples to be annotated in the initial annotated data set are input into a pre-trained language model that has been pre-trained, and a first annotated data set for the samples to be annotated is output, and the first annotated data set is input into a preset screening network, and the first annotated data set is divided into
  • the confidence data sets corresponding to different confidence levels respond to the labeling instruction operation for the confidence data set with a confidence level lower than the preset confidence level, obtain the mining difficult sample data set, perform data enhancement on the mining difficult sample data set, obtain the enhanced data set, and finally, mix the initial labeling data set, the first labeling data set, the mining difficult sample data set and the enhanced data set to obtain the target labeling data set of the sample to be labeled.
  • FIG1 is a flowchart of a method for marking text provided in some embodiments of the present application.
  • FIG2 is a schematic diagram of a model structure of a pre-trained language model provided in some embodiments of the present application.
  • FIG3 is a schematic diagram of a screening network structure provided in some embodiments of the present application.
  • FIG4 is a flow chart of a medical text annotation method provided in some embodiments of the present application.
  • FIG5 is a structural block diagram of a text annotation device provided in some embodiments of the present application.
  • FIG6 is a schematic diagram of the structure of a non-volatile readable storage medium provided in some embodiments of the present application.
  • FIG. 7 is a schematic diagram of the hardware structure of an electronic device implementing various embodiments of the present application.
  • large-scale pre-trained language models have a wide range of general basic knowledge and have obvious advantages in natural language dialogue, small talk, open domain question and answer, reading comprehension, etc.
  • large-scale pre-trained language models also have shortcomings.
  • due to the huge number of parameters of large-scale pre-trained language models the inference time is relatively long and cannot meet high-frequency concurrency requirements.
  • one of the core inventions of the present application is that in the text annotation process, an initial text data set is first collected, and then an initial annotation data set is obtained in response to an annotation instruction operation for the initial text data set, wherein the initial annotation data set contains samples to be annotated, and then the samples to be annotated in the initial annotation data set are input into a pre-trained pre-trained language model, and a first annotation data set for the samples to be annotated is output, and the first annotation data set is input into a preset screening network, and the first annotation data set is divided into confidence data sets corresponding to different confidences, and in response to an annotation instruction operation for a confidence data set lower than a preset confidence, a mining difficult sample data set is obtained, and data enhancement is performed on the mining difficult sample data set to obtain an enhanced data set, and finally, the initial annotation data set, the first annotation data set, the mining difficult sample data set and the enhanced data set are mixed to obtain a target annotation data set for the samples to be annotated.
  • the present application automatically annot
  • Step 101 collecting an initial text data set
  • the collected initial text data set is a medical text data set
  • some embodiments of the present application will be described with respect to the medical text data set.
  • those skilled in the art can set text annotation objects according to actual needs, and the embodiments of the present application are not limited to this.
  • Step 102 in response to the annotation instruction operation on the initial text data set, an initial annotated data set is obtained; wherein, The initial labeled dataset contains samples to be labeled;
  • For the labeling instruction operation it is the action of manual labeling; for the initial labeling data set, it is the result of manual labeling of the medical text data set.
  • manual annotation refers to the manual identification and annotation of named entities in medical texts.
  • Some embodiments of the present application use ten categories of named entities for corresponding annotations, among which named entities can include symptom subjects, modifications of symptom subjects, symptom descriptions, modifications of symptom descriptions, examination item names, and examination results.
  • the modifications of symptom descriptions at least include nature, frequency, time, conditions, and degree.
  • each category of named entities and their annotated labels are used in the following manner, see Table 1:
  • named entities do not cross, overlap, or nest, and do not contain punctuation marks.
  • the left and right sides of the named entities to be labeled are labeled using corresponding labeling symbols, and the labels on the left and right sides of the named entities need to be symmetrical.
  • Named entities can be names of people, institutions, places, and all other entities identified by names. More general entities also include numbers, dates, currencies, addresses, etc.
  • biomedical named entities are involved.
  • important named entities include: gene names, protein names, protein structure attribute names, compound names, drug names, and disease names, among which the most important are gene names and protein names.
  • ten types of named entities will be labeled.
  • Named entities include at least symptom subjects, modifications of symptom subjects, symptom descriptions, modifications of symptom descriptions, examination item names, and examination results. The modifications of symptom descriptions include nature, frequency, time, conditions, and degree.
  • annotation specifications for named entities in some embodiments of the present application are as follows:
  • Symptom subject includes: human body parts, secretions, excretions or normal physiological activities of the human body, such as: head, chest, Abdomen, arms, legs, stool, urine, sputum, etc. are marked with “[SUB]”. For example:
  • Modifiers are marked with “[DS]” and are divided into three categories:
  • the symptom subject can be modified by locative words, nouns, adjectives, adverbs, numerals, quantifiers, and words such as " ⁇ ", and can appear before or after the symptom subject. For example:
  • the part is modified by the part, the modifier is marked as the modification of the symptom subject, and the subject is marked as the symptom subject.
  • the modifier is marked as the modification of the symptom subject
  • the subject is marked as the symptom subject.
  • Monosyllabic words such as: front, back, inside, outside, inside, north, east, side, side, bottom, between, end, beside, etc.;
  • Disyllabic words such as: between, to the north, etc.;
  • Two monosyllabic words are used together, for example: front and back, left and right, up and down, northeast, etc.;
  • Symptom description refers to the patient's self-reported symptoms and abnormal signs, including phrases modified by adjectives, adverbs, verbs, nouns, etc. Some symptoms do not specify the site and can exist independently.
  • the default site is the whole body, which can be directly marked with "[DES]". For example:
  • the modifying elements include nature, frequency, time, condition and degree.
  • conditional modification is expressed as time, such as early morning, evening, afternoon, etc.; other time conditions that are mixed and unclear can be unified as conditional modification, such as after dinner, before breakfast, and lying down at night, and marked with "[DDC]". If conditional modification and time modification appear at the same time, they should be marked separately. For example:
  • the child's [DES] hernia [DES] [DS] is unilateral [DS] [DES] [DDE] and is quite large [DDE];
  • ultrasound examination revealed abnormal gastric bubble morphology with double bubble sign, and ultrasound examination indicated duodenal stenosis or atresia, dark area of amniotic fluid was 76 mm, and amniotic fluid index was 241 mm.
  • the marking specifications are as follows:
  • EXM Ultrasound [EXM] findings: [RES] Two-dimensional M-mode of heart [RES]. [EXM] Spectral Doppler blood flow measurement [EXM]: [RES] Aorta [RES]: [RES] Sinus 16MM [RES], [RES] Trunk 13MM [RES], [RES] Aortic arch 9MM [RES], [RES] Left atrium [RES]: [RES] 16MM [RES]. [RES] Left ventricle 20MM [RES] ([RES] Long axis [RES]), [RES] Ventricular septum [RES]: [RES] 6MM [RES];
  • annotation specifications for named entities in some embodiments of the present application.
  • annotation of medical texts in some embodiments of the present application follows the above annotation specifications. It should be noted that technical personnel in this field can adjust the annotation specifications for named entities according to actual needs, and the present application does not impose any restrictions on this.
  • an initial medical text dataset is collected, and in response to a labeling instruction operation on the initial medical text dataset, manual labeling is performed to obtain an initial labeled dataset, wherein the initial labeled dataset includes an initial labeled sample and an initial labeled label.
  • the sample to be labeled includes the text to be labeled and the labeling task;
  • the text to be labeled is the input text information, for example, "I have had intermittent abdominal pain since yesterday morning until now", and the text to be labeled contains named entities, such as "abdominal pain” and "intermittent" in the text information.
  • the labeling task it is to describe the labeling task to be performed in natural language, which can be to label symptoms, frequency, etc., for example, "point out the symptoms in the input sentence"; for the first labeled data set, it is the data set that is labeled for the first time by the pre-trained language model for the output of the sample to be labeled.
  • the samples to be labeled in the initial labeled data set are input into a pre-trained language model that has been pre-trained, and the pre-trained language model then outputs a first labeled data set for the samples to be labeled.
  • Step 104 inputting the first annotated data set into a preset screening network, and dividing the first annotated data set into confidence data sets corresponding to different confidence levels;
  • the screening network includes InputLayer, Embedding, LSTM, Attention Layer, CRF and Softmax.
  • the screening network is a BiLSTM+Attention mechanism+CRF neural network architecture.
  • this structure can extract deep features of tensors in the input layer, reduce the number of neurons, increase recognition accuracy and reduce training time.
  • it has the advantage of fast response that massive models lack.
  • the confidence interval of a probability sample is an interval estimate of a population parameter of the sample.
  • the confidence interval shows the degree to which the true value of the parameter has a certain probability of falling around the measurement result.
  • the confidence interval gives the range of credibility of the measured value of the measured parameter, that is, the "certain probability" required above, which is called the confidence level.
  • each named entity has a corresponding confidence.
  • the first labeled data set can be input into a preset screening network, and the confidence of the named entity can be identified and inferred according to the screening network. Then, the first labeled data set can be batched according to the confidence of the embedding layer result of the screening network. Among them, the first labeled data set can be divided into a high-confidence data set, a medium-confidence data set and a low-confidence data set according to the confidence interval of the confidence.
  • the high-confidence credible range area is (0.8, 1]
  • the medium-confidence credible range area is (0.5, 0.8]
  • the low-confidence credible range area is (0, 0.5].
  • the confidence of the named entity is 0.4
  • the named entity is divided into data in the low-confidence credible range area. Since the first labeled data set contains named entities, the confidence batches are divided according to the total confidence results of the corresponding named entities in the first labeled data set. If the first labeled data set is in the high-confidence credible range area (0.8, 1], it is divided into a high-confidence data set. Similarly, if the first labeled data set is in the medium-confidence credible range area (0.5, 0.8], it is divided into a medium-confidence data set. If the first labeled data set is in the low-confidence credible range area (0, 0.5], it is divided into a low-confidence data set.
  • the first annotated data set is input into a preset screening network, and the confidence of the embedding layer results in the screening network divides the first annotated data set into confidence data sets corresponding to different confidences according to the corresponding credibility range, wherein the confidence data set includes a high confidence data set, a medium confidence data set and a low confidence data set.
  • Step 105 in response to the labeling instruction operation for the confidence data set with a confidence level lower than a preset confidence level, obtaining a data set of samples with difficulty in mining;
  • the labeling instruction operation it is a manual labeling process; for mining difficult sample data sets, it is to select data in the first labeled data set corresponding to low confidence as samples.
  • the low-confidence credible range area is (0, 0.5]
  • the low-confidence data set divided into the credible range area is labeled, and the labeled data set is used as the difficult sample data set for mining.
  • the sample to be labeled is input into a pre-trained language model and after obtaining a first labeled data set, the first labeled data set is input into a preset screening network, the confidence of the embedding layer result in the screening network divides the first labeled data set into confidence data sets corresponding to different confidence levels according to the corresponding credibility range, and then responds to the labeling instruction operation for the confidence data set lower than the preset confidence level to obtain a difficult-to-mine sample data set.
  • Step 106 performing data enhancement on the data set of difficult-to-mine samples to obtain an enhanced data set
  • For data enhancement it is a preset data enhancement method; for enhanced data sets, it is a data set obtained by data enhancement of a difficult sample data set for mining.
  • the sample to be labeled is input into the pre-trained language model and after obtaining a first labeled data set for the sample to be labeled, the first labeled data set is input into a preset screening network, and the embedding in the screening network is The confidence of the layer results is divided into the first labeled data set corresponding to different confidence levels according to the corresponding credibility range, and then in response to the labeling instruction operation for the confidence data set with a confidence level lower than the preset confidence level, a mining difficult sample data set is obtained, and then data enhancement is performed on all the data in the mining difficult sample data set to obtain an enhanced data set.
  • Step 107 the initial annotated data set, the first annotated data set, the mining difficult sample data set and the enhanced data set are mixed to obtain a target annotated data set of samples to be annotated.
  • the target annotation result obtained for the sample to be annotated which is a result obtained by mixing the initial annotation dataset and the first annotation dataset obtained multiple times, the mining difficult sample dataset and the enhanced dataset.
  • the initial annotated data set, the first annotated data set obtained each time, the mining difficult sample data set and the enhanced data set are mixed to obtain a target annotated data set of samples to be annotated.
  • an initial text data set is first collected, and then an initial annotation data set is obtained in response to an annotation instruction operation for the initial text data set, wherein the initial annotation data set contains samples to be annotated, and then the samples to be annotated in the initial annotation data set are input into a pre-trained pre-trained language model, and a first annotation data set for the samples to be annotated is output, and the first annotation data set is input into a preset screening network, and the first annotation data set is divided into confidence data sets corresponding to different confidences, and in response to an annotation instruction operation for a confidence data set lower than a preset confidence, a mining difficult sample data set is obtained, and data enhancement is performed on the mining difficult sample data set to obtain an enhanced data set, and finally, the initial annotation data set, the first annotation data set, the mining difficult sample data set and the enhanced data set are mixed to obtain a target annotation data set for the samples to be annotated.
  • the present application automatically annotates text using a pre-trained language model to quickly
  • step 103 inputting the samples to be labeled in the initial labeled data set into a pre-trained language model that has been pre-trained, and outputting a first labeled data set for the samples to be labeled, includes:
  • the first prompt of the pre-built pre-trained language model is input as data into the pre-trained language model to obtain a first labeled data set of samples to be labeled.
  • the first Prompt of the pre-trained language model may include samples to be labeled.
  • the first Prompt of the pre-trained language model is a pre-built model, which may be input into the pre-trained language model as data, and the pre-trained language model executes the first Prompt of the pre-trained language model, thereby outputting the label result of the sample to be labeled in the first Prompt of the pre-trained language model.
  • the first Prompt of the pre-trained language model is input as data into the pre-trained language model, and the pre-trained language model executes the first Prompt of the pre-trained language model, and then outputs the label result of the sample to be labeled in the first Prompt of the pre-trained language model, that is, obtains the first labeled data set of the sample to be labeled.
  • step 103 inputting the samples to be labeled in the initial labeled data set into a pre-trained language model that has been pre-trained, and outputting a first labeled data set for the samples to be labeled, includes:
  • the labeled named entities and the to-be-labeled texts are combined into data pairs, and a plurality of data pairs constitute a first labeled data set of the to-be-labeled samples.
  • the first prompt of the pre-trained language model it is composed of the labeling task, the first case and the sample to be labeled.
  • the first case it is a case of performing a labeling task, for example, “Input: I have had intermittent abdominal pain since yesterday morning. Symptoms: abdominal pain.”
  • the first case consists of labeled text, labeling tasks, and named entities, and the labeled text contains named entities.
  • the labeling task it is to describe the labeling task to be performed in natural language, which can be to label symptoms, frequency, etc., for example, "point out the symptoms in the input sentence"; for the first labeled data set, it is the data set that is labeled for the first time by the pre-trained language model for the output of the sample to be labeled.
  • the sample to be labeled it is composed of the text to be labeled and the labeling task.
  • the text to be labeled contains named entities.
  • the text to be labeled is the input text information, for example, "I have had intermittent abdominal pain from yesterday morning to now", and the text to be labeled contains named entities, such as "abdominal pain” and "intermittent” in the text information.
  • Named entities can be names of people, institutions, places, and all other entities identified by names. More general entities also include numbers, dates, currencies, addresses, etc.
  • biomedical named entities are involved.
  • important named entities include: gene names, protein names, protein structure attribute names, compound names, drug names, and disease names, among which the most important are gene names and protein names.
  • ten types of named entities will be labeled.
  • Named entities include at least symptom subjects, modifications of symptom subjects, symptom descriptions, modifications of symptom descriptions, examination item names, and examination results. The modifications of symptom descriptions include nature, frequency, time, conditions, and degree.
  • the execution structure of the first prompt of the pre-trained language model for labeling named entities is as follows:
  • Input I have had intermittent abdominal pain since yesterday morning. Symptoms: Abdominal pain.
  • Input I have been experiencing chest tightness from time to time in recent months. Symptoms: chest tightness.
  • Input The child has a cold and keeps coughing at night. Symptoms: Coughing.
  • Input The child has a large hernia on one side. Symptoms:
  • the named entities in the case expression form can be considered as the output label results, that is, to obtain the named entities in the input text and obtain the corresponding labels, so as to label the named entities of the medical text with corresponding labels, and obtain the labeling results of this time; for the sample to be labeled, it is the "Input: The child's hernia is large on one side.
  • Symptoms:" in the example, and its expression form is "input: + text to be labeled + labeling task", in which the "symptoms:” in the sample to be labeled in the example needs to be output by the pre-trained language model to obtain the labeling results.
  • the first prompt of a pretrained language model is input as data into the pretrained language model, and the pretrained language model executes the first prompt of the pretrained language model.
  • the named entity in the text to be annotated is output, and then the label corresponding to the named entity is obtained, the obtained label is annotated for the named entity, and the labeled named entity and the text to be annotated are formed into data pairs, and the multiple data pairs constitute the first annotated data set of the samples to be annotated.
  • step 104 inputting the first labeled data set into a preset screening network, and dividing the first labeled data set into confidence data sets corresponding to different confidences, includes:
  • the first labeled data set is located in a high confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a high confidence data set;
  • the first annotated data set is located in a medium confidence region preset by the embedding layer in the screening network, the first annotated data set is divided into a medium confidence data set;
  • the first labeled data set is located in a low confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a low confidence data set.
  • the screening network includes InputLayer (input layer), Embedding (embedding layer), LSTM (long short-term memory layer), Attention Layer (attention layer), CRF (conditional random field layer) and Softmax (classification network layer).
  • the screening network is a BiLSTM+Attention mechanism+CRF neural network architecture.
  • this structure can extract deep-level features of tensors in the input layer, reduce the number of neurons, increase recognition accuracy and reduce training time. When used on an inference platform, it has the advantage of fast response that massive models lack.
  • the confidence interval of a probability sample is an interval estimate of a population parameter of the sample.
  • the confidence interval shows the degree to which the true value of the parameter has a certain probability of falling around the measurement result.
  • the confidence interval gives the range of credibility of the measured value of the measured parameter, that is, the "certain probability" required above. This probability is called the confidence level.
  • each named entity has a corresponding confidence.
  • the first labeled data set can be input into a preset screening network, and the confidence of the named entity can be identified and inferred according to the screening network. Then, the first labeled data set can be batched according to the confidence of the embedding layer result of the screening network. Among them, the first labeled data set can be divided into a high-confidence data set, a medium-confidence data set and a low-confidence data set according to the confidence interval of the confidence.
  • the high confidence credible range area is (0.8, 1]
  • the medium confidence credible range area is (0.5, 0.8]
  • the low confidence credible range area is (0, 0.5].
  • the confidence of the named entity is 0.4
  • the named entity is divided into data in the low confidence area. Since the first labeled data set contains named entities, the confidence batches are divided according to the total confidence results of the corresponding named entities in the first labeled data set. If the first labeled data set is in the high confidence credible range area (0.8, 1], it is divided into a high confidence data set. Similarly, if the first labeled data set is in the medium confidence credible range area (0.5, 0.8], it is divided into a medium confidence data set. If the first labeled data set is in the low confidence credible range area (0, 0.5], it is divided into a low confidence data set.
  • the first labeled data set is divided into a high confidence data set; if the first labeled data set is located in a medium confidence area preset in the embedding layer of the screening network, the first labeled data set is divided into a medium confidence data set; if the first labeled data set is located in a low confidence area preset in the embedding layer of the screening network, the first labeled data set is divided into a low confidence data set.
  • step 106 performing data enhancement on the difficult sample data set to obtain an enhanced data set, includes:
  • All data in the mining difficult sample data set are input into the second prompt of the pre-trained pre-trained language model, and the second prompt of the pre-trained language model is input into the pre-trained language model as data to obtain an enhanced data set.
  • the second tip for the pre-trained language model is that it can be used to perform data augmentation on difficult datasets.
  • all data in the mining difficult sample data set are input into the second prompt of the pre-trained pre-trained language model, and the second prompt of the pre-trained language model is input as data into the pre-trained language model.
  • the pre-trained language model executes the second prompt of the pre-trained language model to obtain an enhanced data set.
  • the mining difficult sample data set includes text to be enhanced
  • step 106 performing data enhancement on the mining difficult sample data set to obtain an enhanced data set
  • the method further includes:
  • the second prompt of the pre-trained language model can be used to perform data enhancement on mining difficult data sets. It is mainly used to generate similar sentences for all data in mining difficult data sets.
  • the second prompt of the pre-trained language model consists of an enhancement task, a second case, and a sample to be enhanced.
  • the second case is a case for performing the enhancement task.
  • the second case consists of annotated text, an enhancement task, and similar sentences. For example, "Input: I have had intermittent abdominal pain from yesterday morning to now. Similar sentence: I have had abdominal pain every once in a while from yesterday morning to now.”
  • all data in the mining difficult sample data set is input into the second prompt of a pre-trained language model that has been pre-trained, and the second prompt of the pre-trained language model is input as data into the pre-trained language model, which is executed by the pre-trained language model to output similar sentences for all data in the mining difficult sample data set, and the similar sentences are mixed with the mining difficult sample data set to obtain an enhanced data set.
  • the mining difficult sample data set includes text to be enhanced
  • step 106, performing data enhancement on the mining difficult sample data set to obtain an enhanced data set includes:
  • the second prompt of the pre-trained language model is a method for data enhancement for mining difficult data sets.
  • the second prompt of the pre-trained language model is composed of an enhancement task, a second case and a sample to be enhanced; wherein the second case is a case for performing the enhancement task, and the second case is composed of annotated text, an enhancement task and similar sentences, for example, "Input: I have had intermittent abdominal pain from yesterday morning to now. Similar sentence: I have had abdominal pain every once in a while from yesterday morning to now.”.
  • the enhancement task it is to describe the enhancement task to be performed in natural language, for example, "Give similar sentences to the following sentences"; for the sample to be enhanced, it is composed of the text to be enhanced and the enhancement task, for example, "Input: The child's hernia is unilaterally large. Similar sentences:”.
  • the execution structure of the second prompt of the pre-trained language model for generating data augmentation samples is as follows:
  • Input I have been experiencing chest tightness from time to time in recent months.
  • Input The child has a cold and keeps coughing at night.
  • the child has a large hernia on one side.
  • the text to be enhanced in the mining difficult sample data set is input into the second prompt of the pre-trained language model
  • the second prompt of the pre-trained language model is input into the pre-trained language model as data
  • the pre-trained language model executes the second prompt of the pre-trained language model, and outputs a similar sentence corresponding to the text to be enhanced according to the second case in the second prompt of the pre-trained language model, and the similar sentence is mixed with the mining difficult sample data set to obtain an enhanced data set.
  • step 107 mixing the initial annotated data set, the first annotated data set, the mining difficult sample data set and the enhanced data set to obtain a target annotated data set of samples to be annotated, includes:
  • the number of acquisition times of the mixed enhanced data set is determined, and when the number of acquisition times of the enhanced data set is greater than the preset execution times, the target labeled data set of the sample to be labeled is output.
  • the preset number of executions return to execute the step of inputting the samples to be annotated in the initial annotated dataset into the pre-trained pre-trained language model, and outputting the first annotated dataset for the samples to be annotated, so as to obtain multiple enhanced datasets; wherein the preset number of executions is three times.
  • step 101 Since it is necessary to return to step 101 to obtain multiple enhanced data sets, multiple first labeled data sets, mining difficult sample data sets and enhanced data sets will be obtained.
  • the initial labeled dataset is mixed with the first labeled dataset acquired multiple times, the difficult sample dataset and the enhanced dataset, and the number of times the enhanced dataset is acquired is determined.
  • the target labeled dataset of the samples to be labeled is output.
  • an initial text data set is first collected, and then an initial annotated data set is obtained in response to an annotation instruction operation for the initial text data set, wherein the initial annotated data set includes samples to be annotated, and then the samples to be annotated in the initial annotated data set are input into a pre-trained language model that has been pre-trained, and a first annotated data set for the samples to be annotated is output, and the first annotated data set is input into a preset screening network, and the first annotated data set is divided into confidence data sets corresponding to different confidence levels, and in response to an annotation instruction operation for a confidence data set lower than a preset confidence level, a mining difficult sample data set is obtained, and data enhancement is performed on the mining difficult sample data set to obtain an enhanced data set, and finally, the initial annotated data set, the first annotated data set, the mining difficult sample data set, and the enhanced data set are mixed,
  • FIG. 4 a flow chart of a medical text annotation method provided in some embodiments of the present application is shown. As can be seen from the figure:
  • the first prompt of the pre-trained language model includes samples to be labeled, the first prompt of the pre-trained language model is input as data into the pre-trained language model, and the pre-trained language model labels the samples to be labeled in the first prompt of the pre-trained language model.
  • step S7 For the enhanced data set, repeat the S2-S5 process a fixed number of times, which is three times, and determine whether the number of annotations N is less than or equal to three times. If the number of annotations is less than or equal to three times, return to execute step S2; if the number of annotations is greater than three times, the fixed number of times of the enhanced data set also meets the condition of three times, and the target labeled data set is output.
  • a structural block diagram of a text annotation device provided in some embodiments of the present application is shown, which may specifically include the following modules:
  • An initial text data set collecting module 501 is used to collect an initial text data set
  • the initial annotated data set acquisition module 502 is used to obtain an initial annotated data set in response to an annotation instruction operation on the initial text data set; wherein the initial annotated data set includes samples to be annotated;
  • a first annotated data set acquisition module 503 is used to input the samples to be annotated in the initial annotated data set into a pre-trained language model that has been pre-trained, and output a first annotated data set for the samples to be annotated;
  • a confidence data set division module 504 is used to input the first annotated data set into a preset screening network, and divide the first annotated data set into confidence data sets corresponding to different confidences;
  • the mining difficult sample data set acquisition module 505 is used to obtain the mining difficult sample data set in response to the labeling instruction operation for the confidence data set with a confidence level lower than a preset confidence level;
  • An enhanced data set acquisition module 506 is used to perform data enhancement on the difficult-to-mine sample data set to obtain an enhanced data set;
  • the target annotated data set acquisition module 507 is used to obtain the initial annotated data set, the first annotated data set, the mining difficult sample data set, and the target annotated data set. This dataset is mixed with the enhanced dataset to obtain the target labeled dataset of the samples to be labeled.
  • the first annotated data set acquisition module 503 is specifically used to:
  • the first prompt of the pre-built pre-trained language model is input as data into the pre-trained language model to obtain a first labeled data set of samples to be labeled.
  • the first annotated data set acquisition module 503 is specifically used to:
  • the labeled named entities and the to-be-labeled texts are combined into data pairs, and a plurality of data pairs constitute a first labeled data set of the to-be-labeled samples.
  • the confidence data set partitioning module 504 is specifically used for:
  • the first labeled data set is located in a high confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a high confidence data set;
  • the first annotated data set is located in a medium confidence region preset by the embedding layer in the screening network, the first annotated data set is divided into a medium confidence data set;
  • the first labeled data set is located in a low confidence region preset by the embedding layer in the screening network, the first labeled data set is divided into a low confidence data set.
  • the enhanced data set acquisition module 506 is specifically used to:
  • All data in the mining difficult sample data set are input into the second prompt of the pre-trained pre-trained language model, and the second prompt of the pre-trained language model is input into the pre-trained language model as data to obtain an enhanced data set.
  • the enhanced data set acquisition module 506 is further configured to:
  • the mining difficult sample data set includes text to be enhanced, and the enhanced data set acquisition module 506 is specifically used to:
  • the target annotation data set acquisition module 507 is specifically used to:
  • the number of acquisition times of the mixed enhanced data set is determined, and when the number of acquisition times of the enhanced data set is greater than the preset execution times, the target labeled data set of the sample to be labeled is output.
  • the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
  • some embodiments of the present application further provide an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor.
  • a processor a memory
  • a computer program stored in the memory and executable on the processor.
  • FIG6 is a schematic diagram of the structure of a non-volatile readable storage medium provided in some embodiments of the present application.
  • Some embodiments of the present application also provide a non-volatile readable storage medium 601, on which a computer program is stored.
  • a non-volatile readable storage medium 601 is, for example, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk, etc.
  • FIG. 7 is a schematic diagram of the hardware structure of an electronic device implementing various embodiments of the present application.
  • the electronic device 700 includes but is not limited to: a radio frequency unit 701, a network module 702, an audio output unit 703, an input unit 704, a sensor 705, a display unit 706, a user input unit 707, an interface unit 708, a memory 709, a processor 710, and a power supply 711. It can be understood by those skilled in the art that the electronic device structure shown in FIG. 7 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange the components differently. In some embodiments of the present application, the electronic device includes but is not limited to a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted terminal, a wearable device, and a pedometer.
  • the radio frequency unit 701 can be used to receive and send signals during the process of sending and receiving information or making calls. Specifically, after receiving the downlink data from the base station, it is sent to the processor 710 for processing; in addition, the uplink data is sent to the base station.
  • the radio frequency unit 701 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc.
  • the radio frequency unit 701 can also communicate with the network and other devices through a wireless communication system.
  • the electronic device provides users with wireless broadband Internet access through the network module 702, such as helping users to send and receive emails, browse web pages, and access streaming media.
  • the audio output unit 703 can convert the audio data received by the RF unit 701 or the network module 702 or stored in the memory 709 into an audio signal and output it as sound. Moreover, the audio output unit 703 can also provide audio output related to a specific function performed by the electronic device 700 (for example, a call signal reception sound, a message reception sound, etc.).
  • the audio output unit 703 includes a speaker, a buzzer, a receiver, etc.
  • the input unit 704 is used to receive audio or video signals.
  • the input unit 704 may include a graphics processor (GPU) 7041 and a microphone 7042, and the graphics processor 7041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode.
  • the processed image frame can be displayed on the display unit 706.
  • the image frame processed by the graphics processor 7041 can be stored in the memory 709 (or other storage medium) or sent via the radio frequency unit 701 or the network module 702.
  • the microphone 7042 can receive sound and can process such sound into audio data.
  • the processed audio data can be converted into a format output that can be sent to a mobile communication base station via the radio frequency unit 701 in the case of a telephone call mode.
  • the electronic device 700 also includes at least one sensor 705, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor includes an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 7061 according to the brightness of the ambient light
  • the proximity sensor can turn off the display panel 7061 and/or the backlight when the electronic device 700 is moved to the ear.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, which can be used to identify the posture of the electronic device (such as horizontal Vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc.; sensor 705 can also include fingerprint sensor, pressure sensor, iris sensor, molecular sensor, gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc., which will not be repeated here.
  • the display unit 706 is used to display information input by the user or information provided to the user.
  • the display unit 706 may include a display panel 7061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the user input unit 707 can be used to receive input digital or character information, and to generate key signal input related to user settings and function control of the electronic device.
  • the user input unit 707 includes a touch panel 7071 and other input devices 7072.
  • the touch panel 7071 also known as a touch screen, can collect the user's touch operation on or near it (such as the user's operation on the touch panel 7071 or near the touch panel 7071 using any suitable object or accessory such as a finger, stylus, etc.).
  • the touch panel 7071 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into the contact point coordinates, and then sends it to the processor 710, receives the command sent by the processor 710 and executes it.
  • the touch panel 7071 can be implemented using multiple types such as resistive, capacitive, infrared and surface acoustic waves.
  • the user input unit 707 may also include other input devices 7072.
  • other input devices 7072 may include but are not limited to a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail here.
  • the touch panel 7071 may be covered on the display panel 7061.
  • the touch panel 7071 detects a touch operation on or near it, it is transmitted to the processor 710 to determine the type of the touch event, and then the processor 710 provides a corresponding visual output on the display panel 7061 according to the type of the touch event.
  • the touch panel 7071 and the display panel 7061 are used as two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 7071 and the display panel 7061 may be integrated to implement the input and output functions of the electronic device, which is not limited to the specifics herein.
  • the interface unit 708 is an interface for connecting an external device to the electronic device 700.
  • the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, an audio input/output (I/O) port, a video I/O port, a headphone port, etc.
  • the interface unit 708 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic device 700 or may be used to transmit data between the electronic device 700 and an external device.
  • the memory 709 can be used to store software programs and various data.
  • the memory 709 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area can store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc.
  • the memory 709 can include a high-speed random access memory, and can also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 710 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. It executes various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 709 and calling data stored in the memory 709, thereby monitoring the electronic device as a whole.
  • the processor 710 may include one or more processing units; preferably, the processor 710 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, and the modem processor mainly processes wireless communication. It is understandable that the above-mentioned modem processor may not be integrated into the processor 710.
  • the electronic device 700 may also include a power supply 711 (such as a battery) for supplying power to each component.
  • a power supply 711 (such as a battery) for supplying power to each component.
  • the power supply 711 may be logically connected to the processor 710 through a power management system, thereby implementing functions such as charging, discharging, and power consumption management through the power management system.
  • the electronic device 700 includes some functional modules not shown, which will not be described in detail here.
  • the technical solution of the present application can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a magnetic disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods of each embodiment of the present application.
  • a storage medium such as ROM/RAM, a magnetic disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a Computer-readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for enabling a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes, such as USB flash drives, mobile hard disks, ROM, RAM, magnetic disks, or optical disks.

Landscapes

  • Machine Translation (AREA)

Abstract

一种文本的标注方法、装置、电子设备及可读存储介质,方法包括:收集初始文本数据集;响应于针对初始文本数据集的标注指令操作,得到初始标注数据集;将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出第一标注数据集;将第一标注数据集输入到预设的筛选网络,将第一标注数据集划分为不同置信度所对应的置信度数据集;响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集;对挖掘困难样本数据集进行数据增强,得到增强数据集;将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集混合,得到目标标注数据集。

Description

一种文本的标注方法、装置、电子设备及可读存储介质
相关申请的交叉引用
本申请要求于2022年12月05日提交中国专利局,申请号202211549045.5,申请名称为“一种文本的标注方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请一些实施例涉及人工智能技术领域,特别是涉及一种文本的标注方法、一种文本的标注装置、一种电子设备以及一种非易失性可读存储介质。
背景技术
在现有技术中,大规模预训练语言模型有着广泛的通用基础知识,在自然语言对话、闲聊、开放域问答、阅读理解等方面有着不言而喻的优势,但是大规模预训练语言模型也存在不足,一方面,在不同行业领域的专业知识上,技术比较欠缺,因而在实际项目中有一定的困难;另一方面,因大规模预训练语言模型的参数量巨大,因而推理时间较长,无法满足高频并发需求。
发明内容
本申请一些实施例是提供一种文本的标注方法、装置、电子设备以及非易失性可读存储介质,以解决或部分解决目前使用预训练语言模型在对不同行业领域的专业知识上进行文本文件标注的方法存在技术欠缺、效率较低且耗资耗时巨大,无法满足高频并发需求的问题。
本申请一些实施例公开了一种文本的标注方法,方法包括:
收集初始文本数据集;
响应于针对初始文本数据集的标注指令操作,得到初始标注数据集;其中,初始标注数据集包含待标注样本;
将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集;
将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集;
响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集;
对挖掘困难样本数据集进行数据增强,得到增强数据集;
将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。
在一些实施例中,将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,包括:
将预先构建完成的预训练语言模型的第一Prompt作为数据输入到预训练语言模型,得到待标注样本的第一标注数据集。
在一些实施例中,预训练语言模型的第一Prompt由标注任务、第一案例和待标注样本构成。
在一些实施例中,第一案例为执行标注任务的案例,第一案例由已标注文本、标注任务 和命名实体构成,已标注文本包含命名实体,待标注样本由待标注文本和标注任务构成,待标注文本包含命名实体。
在一些实施例中,将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,包括:
将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,根据第一案例,输出预训练语言模型的第一Prompt中待标注样本的待标注文本对应的命名实体;
获取命名实体所对应的标签,对命名实体标注标签;
将已进行标签标注的命名实体和待标注文本组成数据对,由多个数据对构成待标注样本的第一标注数据集。
在一些实施例中,命名实体至少包括症状主体、症状主体的修饰、症状描述、症状描述的修饰、检查项名以及检查结果,症状描述的修饰至少包括性质、频率、时间、条件以及程度。
在一些实施例中,预设的筛选网络包括输入层、嵌入层、长短期记忆层、注意力层、条件随机场层以及分类网络层,其中,嵌入层包含置信度,用于将第一标注数据集划分为不同置信度所对应的置信度数据集。
在一些实施例中,置信度数据集包括高置信度数据集、中置信度数据集和低置信度数据集。
在一些实施例中,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集,包括:
若第一标注数据集位于筛选网络中嵌入层预设的高置信度区域,则将第一标注数据集划分为高置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的中置信度区域,则将第一标注数据集划分为中置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的低置信度区域,则将第一标注数据集划分为低置信度数据集。
在一些实施例中,对挖掘困难样本数据集进行数据增强,得到增强数据集,包括:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,得到增强数据集。
在一些实施例中,对挖掘困难样本数据集进行数据增强,得到增强数据集,方法还包括:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
输出针对挖掘困难样本数据集中所有数据的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一些实施例中,预训练语言模型的第二Prompt由增强任务、第二案例和待增强样本构成。
在一些实施例中,第二案例为执行增强任务的案例,第二案例由已标注文本、增强任务和相似句构成。
在一些实施例中,挖掘困难样本数据集包含待增强文本,对挖掘困难样本数据集进行数 据增强,得到增强数据集,包括:
将挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
根据第二案例,输出待增强文本所对应的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一些实施例中,在对挖掘困难样本数据集进行数据增强,得到增强数据集之后,还包括:
对增强数据集进行多次获取,得到多个增强数据集。
在一些实施例中,在将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合之后,还包括:
当获取增强数据集的次数小于或等于预设执行次数时,返回执行将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集的步骤。
在一些实施例中,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集,包括:
将初始标注数据集和多次获取的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合;
判断进行混合的增强数据集的获取次数,当增强数据集的获取次数大于预设执行次数时,输出待标注样本的目标标注数据集。
本申请一些实施例还公开了一种文本的标注装置,装置包括:
初始文本数据集收集模块,用于收集初始文本数据集;
初始标注数据集获取模块,用于响应于针对初始文本数据集的标注指令操作,得到初始标注数据集;其中,初始标注数据集包含待标注样本;
第一标注数据集获取模块,用于将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集;
置信度数据集划分模块,用于将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集;
挖掘困难样本数据集获取模块,用于响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集;
增强数据集获取模块,用于对挖掘困难样本数据集进行数据增强,得到增强数据集;
目标标注数据集获取模块,用于将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。
在一些实施例中,第一标注数据集获取模块具体用于:
将预先构建完成的预训练语言模型的第一Prompt作为数据输入到预训练语言模型,得到待标注样本的第一标注数据集。
在一些实施例中,第一标注数据集获取模块具体用于:
将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,根据第一案例,输出预训练语言模型的第一Prompt中待标注样本的待标注文本对应的命名实体;
获取命名实体所对应的标签,对命名实体标注标签;
将已进行标签标注的命名实体和待标注文本组成数据对,由多个数据对构成待标注样本的第一标注数据集。
在一些实施例中,置信度数据集划分模块具体用于:
若第一标注数据集位于筛选网络中嵌入层预设的高置信度区域,则将第一标注数据集划分为高置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的中置信度区域,则将第一标注数据集划分为中置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的低置信度区域,则将第一标注数据集划分为低置信度数据集。
在一些实施例中,增强数据集获取模块具体用于:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,得到增强数据集。
在一些实施例中,增强数据集获取模块具体还用于:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
输出针对挖掘困难样本数据集中所有数据的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一些实施例中,挖掘困难样本数据集包含待增强文本,增强数据集获取模块具体用于:
将挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
根据第二案例,输出待增强文本所对应的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一些实施例中,目标标注数据集获取模块具体用于:
将初始标注数据集和多次获取的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合;
判断进行混合的增强数据集的获取次数,当增强数据集的获取次数大于预设执行次数时,输出待标注样本的目标标注数据集。
本申请一些实施例还公开了一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器、通信接口以及存储器通过通信总线完成相互间的通信;
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现如本申请一些实施例的方法。
本申请一些实施例还公开了一种非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得处理器执行如本申请一些实施例的方法。
本申请一些实施例包括以下优点:
在本申请一些实施例中,首先收集初始文本数据集,接着响应于针对初始文本数据集的标注指令操作,得到初始标注数据集,其中,初始标注数据集包含待标注样本,然后将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为 不同置信度所对应的置信度数据集,响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集,对挖掘困难样本数据集进行数据增强,得到增强数据集,最后,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。本申请通过使用预训练语言模型对文本进行自动标注,快速产生标注标签,大幅度地提高了标注效率,同时对数据集中的困难样本进行了增强,有助于提升下游任务的精度。
附图说明
图1是本申请一些实施例中提供的一种文本的标注方法的步骤流程图;
图2是本申请一些实施例中提供的预训练语言模型的模型结构示意图;
图3是本申请一些实施例中提供的筛选网络结构示意图;
图4是本申请一些实施例中提供的一种医疗文本标注方法流程图;
图5是本申请一些实施例中提供的一种文本的标注装置的结构框图;
图6是本申请一些实施例中提供的一种非易失性可读存储介质的结构示意图;
图7是实现本申请各个实施例的一种电子设备的硬件结构示意图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
作为一种示例,在现有技术中,大规模预训练语言模型有着广泛的通用基础知识,在自然语言对话、闲聊、开放域问答、阅读理解等方面有着不言而喻的优势,但是大规模预训练语言模型也存在不足,一方面,在不同行业领域的专业知识上,技术比较欠缺,因而在实际项目中有一定的困难;另一方面,因大规模预训练语言模型的参数量巨大,因而推理时间较长,无法满足高频并发需求。
对此,本申请的核心发明点之一在于在文本的标注过程中,首先收集初始文本数据集,接着响应于针对初始文本数据集的标注指令操作,得到初始标注数据集,其中,初始标注数据集包含待标注样本,然后将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集,响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集,对挖掘困难样本数据集进行数据增强,得到增强数据集,最后,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。本申请通过使用预训练语言模型对文本进行自动标注,快速产生标注标签,大幅度地提高了标注效率,同时对数据集中的困难样本进行了增强,有助于提升下游任务的精度。
参照图1,示出了本申请一些实施例中提供的一种文本的标注方法的步骤流程图,具体可以包括如下步骤:
步骤101,收集初始文本数据集;
在本申请一些实施例中,收集的初始文本数据集,其为医疗文本数据集,本申请一些实施例将针对医疗文本数据集进行说明。在一些实施例中,本领域技术人员可以根据实际需求设置文本标注对象,本申请实施例对此不作限制。
步骤102,响应于针对初始文本数据集的标注指令操作,得到初始标注数据集;其中, 初始标注数据集包含待标注样本;
对于标注指令操作,其为人工进行标注的动作;对于初始标注数据集,其为人工对医疗文本数据集进行标注的结果。
其中,对于人工标注,指的是人工对医疗文本的命名实体进行识别与标注,本申请一些实施例使用十类命名实体进行相应的标注,其中,命名实体可以包含症状主体、症状主体的修饰、症状描述、症状描述的修饰、检查项名以及检查结果,其中,症状描述的修饰至少包括性质、频率、时间、条件以及程度。其中,各类别命名实体及其所标注的标签使用如下方式,见表1:
表1
需要说明的是,对于命名实体执行手工标注的原则是:命名实体之间不交叉、不重叠、不嵌套、不包含标点符号,使用对应标注符号将需要标注的命名实体的左侧和右侧打上标签,命名实体的左侧和右侧的标签需要对称。
对于命名实体(named entity),其可以为人名、机构名、地名以及其他所有以名称为标识的实体,更广泛的实体还包括数字、日期、货币、地址等等,在本申请一些实施例中,涉及的是生物医学命名实体,在生物医学领域内,重要的命名实体包括:基因名称、蛋白质名称、蛋白质结构属性名称、化合物名称、药物名称和疾病名称等,其中最重要的是基因名称和蛋白质名称,本申请一些实施例中将对十类命名实体进行标注,命名实体至少包括症状主体、症状主体的修饰、症状描述、症状描述的修饰、检查项名,检查结果,其中,症状描述的修饰包括性质、频率、时间、条件、程度。
其中,本申请一些实施例对于命名实体的标注规范(此标注规范适用于人工执行标注的操作和预训练语言模型进行自动标注的操作)如下:
(1)症状主体
“症状主体”包括:人体部位、分泌物、排泄物或者人体的正常生理活动,如:头、胸、 腹、胳膊、腿、大便、尿液、痰等,用“[SUB]”标注。例如:
[SUB]大腿[SUB];[SUB]小腿[SUB];[SUB]前臂[SUB];[SUB]后脑勺[SUB];[SUB]双手[SUB];[SUB]两腿[SUB];[SUB]双眼[SUB];[SUB]单肺[SUB];[SUB]睡眠[SUB]不好;无法[SUB]怀孕[SUB]。
(2)症状主体的修饰
修饰成分用“[DS]”标出,分为三类:
1)症状主体可被方位词、名词、形容词、副词、数词、量词以及“的、地、得”等文字修饰,可以出现在症状主体的前面或后面。例如:
[DS]左边[DS]的[SUB]手[SUB];[SUB]胸腔[SUB]的[DS]内部[DS];[DS]右侧[DS]的[SUB]肺[SUB];[DS]右侧[DS][SUB]胸腔[SUB];[SUB]脖子[SUB][DS]偏下面[DS]的地方;[SUB]胸腔[SUB][DS]偏左侧[DS]。
2)部位被部位修饰,修饰词标注为症状主体的修饰,主语标注为症状主体。例如:
[DS]掌指骨[DS][SUB]骨骺线[SUB];[DS]右手[DS][SUB]指甲[SUB];
[DS]手部[DS]的[SUB]皮肤[SUB];[DS]口腔内[DS][SUB]扁桃体[SUB]。
3)方位词,表明方向、位置等。
a.单音节词,例如:前、后、里、外、内、北、东、边、侧、底、间、末、旁等;
b.双音节词,例如:之间、以北等;
c.两个单音节词连用,例如:前后、左右、上下、东北等;
d.方位词中的特例“上”,可以表示方位词,还可以表示“在……之中”、“在……之间”,例如:
[SUB]胸[SUB][DS]内[DS];[SUB]胸腔[SUB][DS]外[DS];[SUB]腹[SUB][DS]内[DS];[SUB]胳膊[SUB][DS]下[DS];[SUB]喉咙[SUB][DS]里[DS]。
(3)症状描述
症状描述是指:患者自述的症状、异常体征,包括:形容词、副词、动词、名词等做修饰的短语;其中,部分症状为没有指明部位的,可以独立存在,则默认的部位是全身,可以直接标注出来,用“[DES]”标出。例如:
[DES]发热[DES];[DES]咳嗽[DES];DES]发烧[DES];[DES]休克[DES];[DES]黄疸[DES];[DES]贫血[DES];[DES]恶心[DES];[DES]呕吐[DES]。
(4)症状描述的修饰
修饰成分包括性质、频率、时间、条件、程度这五项。
1)“性质修饰”,例如:顿痛、锐痛、刺痛、钻顶样疼痛等,若“性质修饰”修饰“症状描述”时,则使用符号“[DDP]”进行标注;若“性质修饰”和“症状描述”相连,则“性质修饰”单独标注。
2)“频率修饰”,例如:偶尔、间断、时不时、反复、经常等。修饰“症状描述”时,使用“[DDF]”进行标注。例如:
[DDF]间断[DDF][DES]腹痛[DES];[DDF]偶尔[DDF][DES]头晕[DES];[DDF]时不时[DDF]地[DES]胸闷[DES];[DDF]经常[DDF][DES]便秘[DES]。
3)“时间修饰”,使用“[DDT]”进行标注。例如:
[DDT]夜间[DDT][DES]咳嗽[DES];[DDT]晨起[DDT][DES]关节僵硬[DES];[DDT]晚 上[DDT][DES]尿频[DES];在[DDT]去年[DDT][DDT]怀孕期间[DDT]。
4)“条件修饰”,时间修饰表示为时间,例如清晨、晚间、午后等;其他的时间条件混杂不清的可以统一为条件修饰,如晚间饭后、早饭前、夜间平卧时,使用“[DDC]”进行标注,若条件修饰和时间修饰同时出现,则分开标注。例如:
稍一[DDC]运动[DDC]就[DES]大汗淋漓[DES];
一[DDC]活动[DDC]就[DES]胸闷[DES][DES]气短[DES];
[DDT]夜间[DDT][DDC]平卧[DDC]时[DES]呼吸困难[DES];
[DDC]上楼[DDC][DES]气喘[DES];
[DDT]夜间[DDT][DDC]平卧[DDC];
[DDT]早晨[DDT][DDC]起床[DDC]时。
5)“程度修饰”,指在某种条件下发生某种症状,使用“[DDE]”进行标注。例如:
[SUB]腹部[SUB][DDE]非常[DDE][DES]痛[DES];
[DDE]有点[DDE][DES]头晕[DES];
[SUB]呼吸[SUB][DDE]很[DDE][DES]困难[DES];
[DES]出血[DES][DDE]比较多[DDE];
[DDE]稍微有点[DDE][DES]出血[DES];
孩子[DES]疝气[DES][DS]单侧[DS][DES][DDE]挺大[DDE];
[SUB]包块[SUB][DES]增长[DES][DDE]比较快[DDE]。
(5)检查项名
检查项名含有实验室检查和辅助检查等,如血压、体重、心率、身高、白细胞、红细胞、血红蛋白、尿检、血检、CT、MRI、B超、彩超、核磁等。例如:
[SUB]心脏[SUB][EXM]彩超[EXM];
[SUB]腹部[SUB][EXM]B超[EXM];
[SUB]头颅[SUB][EXM]CT[EXM]。
(6)检查结果
针对医疗检查的结果进行标注,标签为“[EXM]”。例如:
[EXM]白细胞[EXM][RES]高[RES];
做[EXM]彩超[EXM]后发现[RES]右肾大小33m*15mm[RES],[RES]内见点状血流信号[RES];
在[DES]怀孕[DES][DDT]27W[DDT][EXM]B超[EXM]发现[RES]胃泡形态异常呈双泡征[RES],[EXM]B超[EXM]提示注意[RES]十二指肠狭窄[RES]或[RES]闭锁[RES],[RES]羊水暗区76MM[RES],[RES]羊水指数241mm[RES];
若先心病里的病情描述(发病时间、主要症状、就诊医院等)未确定时,则标注规范如下:
[EXM]超声[EXM]所见:[RES]心脏二维M型[RES].[EXM]频谱多普勒血流测值[EXM]:[RES]主动脉[RES]:[RES]窦部16MM[RES],[RES]主干13MM[RES],[RES]主动脉弓9MM[RES],[RES]左心房[RES]:[RES]16MM[RES].[RES]左心室20MM[RES]([RES]长轴[RES]),[RES]室间隔[RES]:[RES]6MM[RES];
[RES]左侧胫骨原骨皮质增厚及骨膜反应三年余[RES]复查示[RES]骨皮质厚度较前明显 变薄[RES],[RES]髓腔大致同右侧[RES],[RES]骨干中段向前略弯曲变形[RES],[RES]余所见无特殊[RES];
[EXM]尿检[EXM][RES]潜血1+[RES];
[SUB]尿液里[SUB]有[EXM]蛋白[EXM];
[EXM]血压[EXM]:[RES]170/100mmHg[RES];
[EXM]体温[EXM]:[RES]38.5℃[RES];
[EXM]呼吸[EXM]:[RES]40次/分[RES];
[EXM]体重[EXM]:[RES]150kg[RES]。
上述为本申请一些实施例对于命名实体的标注规范,本申请一些实施例对于医疗文本的标注均遵循上述标注规范,需要说明的是,本领域技术人员可以根据实际需求对命名实体的标注规范进行调整,本申请对此不作限制。
在本申请一些实施例中,收集初始医疗文本数据集,响应于针对初始医疗文本数据集的标注指令操作,即进行人工标注,得到初始标注数据集,其中初始标注数据集包括初始标注样本和初始标注标签。
步骤103,将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集;
其中,对于预训练语言模型,其可以对文本文件进行智能化标注,通过使用预训练语言模型进行文本自动标注,能够节约人工进行文本标注的时间、工作量以及人力成本。
需要说明的是,本申请一些实施例为针对医疗领域的文本进行自动化标注,并且本申请一些实施例所采用的预训练语言模型为现有模型,参照图2,示出了本申请一些实施例中提供的预训练语言模型的模型结构示意图,对于输入文本序列(x1,x2,...,xn),语言模型输出序列(y1,y2,...,yn)的概率为:
由于上述预训练语言模型为现有技术,对于预训练语言模型的训练方式,本申请一些实施例在此不作赘述。
对于待标注样本,其包括待标注文本和标注任务;其中,待标注文本为输入的文本信息,例如,“我从昨天早上到现在,间断腹痛”,待标注文本内包含有命名实体,例如文本信息内的“腹痛”、“间断”。
对于标注任务,其为用自然语言描述需要执行的标注任务,可以为标注出症状、频率等,例如,“指出输入语句中的症状”;对于第一标注数据集,其为针对待标注样本输出的第一次经过预训练语言模型进行标注的数据集。
在本申请一些实施例中,将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,再由预训练语言模型输出针对待标注样本的第一标注数据集。
步骤104,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集;
对于筛选网络,参照图3,示出了本申请一些实施例中提供的筛选网络结构示意图,其 中,筛选网络结构包括InputLayer(输入层)、Embedding(嵌入层)、LSTM(长短期记忆层)、Attention Layer(注意力层)、CRF(条件随机场层)以及Softmax(分类网络层),该筛选网络为BiLSTM+Attention机制+CRF神经网络架构,该结构作为一种已验证的固定场景下效果较好的命名实体识别网络,能够提取输入层的张量的深层次特征,并且减少神经元的个数,增加识别准确率并且降低训练时间,在推理平台使用时,具有巨量模型所欠缺的快速响应优势。
对于置信度,在统计学中,一个概率样本的置信区间(Confidence interval)是对样本的某个总体参数的区间估计,置信区间展现的是该参数的真实值有一定概率落在测量结果的周围的程度,置信区间给出的是被测量参数测量值的可信程度范围,即前面所要求的“一定概率”,该概率被称为置信水平;具体地,每个命名实体具有对应的置信度,可以将第一标注数据集输入到预设的筛选网络中,根据筛选网络对命名实体的置信度进行识别与推理,再根据筛选网络的嵌入层结果的置信度对第一标注数据集进行批次区分,其中,可以根据置信度的置信区间将第一标注数据集划分为高置信度数据集、中置信度数据集和低置信度数据集。
在一种示例中,假设高置信度的可信范围区域为(0.8,1],中置信度的可信范围区域为(0.5,0.8],低置信度的可信范围区域为(0,0.5],若命名实体的置信度为0.4,则将该命名实体划分为低置信度的可信范围区域的数据,由于第一标注数据集中包含有命名实体,因此根据第一标注数据集中所对应的命名实体的总置信度结果来划分置信度批次,若第一标注数据集位于高置信度的可信范围区域(0.8,1]中,则划分为高置信度数据集,同理可得,若第一标注数据集位于中置信度的可信范围区域(0.5,0.8]中,则划分为中置信度数据集,若第一标注数据集位于低置信度的可信范围区域(0,0.5]中,则划分为低置信度数据集。
在本申请一些实施例中,在得到第一标注数据集之后,将第一标注数据集输入到预设的筛选网络中,筛选网络中的嵌入层结果的置信度根据对应的可信程度范围将第一标注数据集划分为不同置信度所对应的置信度数据集,其中,置信度数据集包括高置信度数据集、中置信度数据集和低置信度数据集。
步骤105,响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集;
对于标注指令操作,其为人工进行标注的过程;对于挖掘困难样本数据集,其为选择低置信度对应的第一标注数据集中的数据作为样本。
在一种示例中,假设低置信度的可信范围区域为(0,0.5],则将划分为该可信范围区域的低置信度数据集进行标注,将标注完成后的数据集作为挖掘困难样本数据集。
在本申请一些实施例中,将待标注样本输入到预训练语言模型中并在得到第一标注数据集之后,将第一标注数据集输入到预设的筛选网络中,筛选网络中的嵌入层结果的置信度根据对应的可信程度范围将第一标注数据集划分为不同置信度所对应的置信度数据集,然后响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集。
步骤106,对挖掘困难样本数据集进行数据增强,得到增强数据集;
对于数据增强,其为预设的一种数据增强的方法;对于增强数据集,其为对挖掘困难样本数据集进行数据增强得到的数据集。
在本申请一些实施例中,将待标注样本输入到预训练语言模型中并在得到针对待标注样本的第一标注数据集之后,将第一标注数据集输入到预设的筛选网络中,筛选网络中的嵌入 层结果的置信度根据对应的可信程度范围将第一标注数据集划分为不同置信度所对应的置信度数据集,然后响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集,然后对挖掘困难样本数据集中的所有数据进行数据增强,从而得到增强数据集。
步骤107,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。
对于目标标注数据集,其为待标注样本得到的目标标注结果,是由初始标注数据集和多次获得的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合得到的结果。
在具体实现中,将初始标注数据集和每一次得到的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合,得到待标注样本的目标标注数据集。
在本申请一些实施例中,首先收集初始文本数据集,接着响应于针对初始文本数据集的标注指令操作,得到初始标注数据集,其中,初始标注数据集包含待标注样本,然后将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集,响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集,对挖掘困难样本数据集进行数据增强,得到增强数据集,最后,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。本申请通过使用预训练语言模型对文本进行自动标注,快速产生标注标签,大幅度地提高了标注效率,同时对数据集中的困难样本进行了增强,有助于提升下游任务的精度。
在一种在一些实施例中实施例中,步骤103、将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,包括:
将预先构建完成的预训练语言模型的第一Prompt作为数据输入到预训练语言模型,得到待标注样本的第一标注数据集。
对于预训练语言模型的第一Prompt,其可以包含待标注样本,预训练语言模型的第一Prompt为预先构建完成的模型,可以作为数据输入到预训练语言模型中,并由预训练语言模型执行预训练语言模型的第一Prompt,进而输出预训练语言模型的第一Prompt中待标注样本的标签结果。
在具体实现中,在预训练语言模型的第一Prompt构建完成后,将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,由预训练语言模型执行预训练语言模型的第一Prompt,进而输出预训练语言模型的第一Prompt中待标注样本的标签结果,即得到待标注样本的第一标注数据集。
在一种在一些实施例中实施例中,步骤103、将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,包括:
将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,根据第一案例,输出预训练语言模型的第一Prompt中待标注样本的待标注文本对应的命名实体;
获取命名实体所对应的标签,对命名实体标注标签;
将已进行标签标注的命名实体和待标注文本组成数据对,由多个数据对构成待标注样本的第一标注数据集。
其中,对于预训练语言模型的第一Prompt,其由标注任务、第一案例和待标注样本构 成;对于第一案例,其为执行标注任务的案例,例如,“输入:我从昨天早上到现在,间断腹痛。症状:腹痛。”,第一案例由已标注文本、标注任务和命名实体构成,已标注文本包含命名实体。
对于标注任务,其为用自然语言描述需要执行的标注任务,可以为标注出症状、频率等,例如,“指出输入语句中的症状”;对于第一标注数据集,其为针对待标注样本输出的第一次经过预训练语言模型进行标注的数据集。
对于待标注样本,其由待标注文本和标注任务构成,待标注文本包含命名实体;其中,待标注文本为输入的文本信息,例如,“我从昨天早上到现在,间断腹痛”,待标注文本内包含有命名实体,例如文本信息内的“腹痛”、“间断”。
对于命名实体(named entity),其可以为人名、机构名、地名以及其他所有以名称为标识的实体,更广泛的实体还包括数字、日期、货币、地址等等,在本申请一些实施例中,涉及的是生物医学命名实体,在生物医学领域内,重要的命名实体包括:基因名称、蛋白质名称、蛋白质结构属性名称、化合物名称、药物名称和疾病名称等,其中最重要的是基因名称和蛋白质名称,本申请一些实施例中将对十类命名实体进行标注,命名实体至少包括症状主体、症状主体的修饰、症状描述、症状描述的修饰、检查项名,检查结果,其中,症状描述的修饰包括性质、频率、时间、条件、程度。
在一种示例中,用于标注命名实体的预训练语言模型的第一prompt的执行结构如下:
“指出输入语句中的症状。
输入:我从昨天早上到现在,间断腹痛。症状:腹痛。
输入:我最近几个月总是时不时的胸闷。症状:胸闷。
输入:孩子感冒了,夜间一直咳嗽。症状:咳嗽。
输入:孩子疝气单侧挺大。症状:”
在此示例中,“指出输入语句中的症状”为上述所提及的标注任务,其为用自然语言描述需要执行的标注任务,如标出症状、频率等;此外“输入:我从昨天早上到现在,间断腹痛。症状:腹痛。”、“输入:我最近几个月总是时不时的胸闷。症状:胸闷。”、“输入:孩子感冒了,夜间一直咳嗽。症状:咳嗽。”这三个样本为执行标注任务的相关案例,采用“输入:+已标注文本+标注任务+命名实体”的表现形式,其中在案例的表现形式中的命名实体可以认为是输出的标签结果,即为获取输入的文本中的命名实体,并获取对应的标签,从而对医疗文本的命名实体打上对应的标签,即可获得本次的标注结果;对于待标注样本,其为示例中的“输入:孩子疝气单侧挺大。症状:”,其表现形式采用的是“输入:+待标注文本+标注任务”,其中,示例中的待标注样本中的“症状:”需要由预训练语言模型进行输出,得到标注结果。
在本申请一些实施例,将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,由预训练语言模型执行预训练语言模型的第一Prompt,根据预训练语言模型的第一Prompt中的第一案例,输出待标注文本中的命名实体,然后,获取该命名实体所对应的标签,对命名实体标注所获取到的标签,将已进行标签标注的命名实体和待标注文本组成数据对,由多个数据对构成待标注样本的第一标注数据集。
在一种在一些实施例中实施例中,步骤104、将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集,包括:
若第一标注数据集位于筛选网络中嵌入层预设的高置信度区域,则将第一标注数据集划分为高置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的中置信度区域,则将第一标注数据集划分为中置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的低置信度区域,则将第一标注数据集划分为低置信度数据集。
对于筛选网络,参照图3,示出了本申请一些实施例中提供的筛选网络结构示意图,其中,筛选网络结构包括InputLayer(输入层)、Embedding(嵌入层)、LSTM(长短期记忆层)、Attention Layer(注意力层)、CRF(条件随机场层)以及Softmax(分类网络层),该筛选网络为BiLSTM+Attention机制+CRF神经网络架构,该结构作为一种已验证的固定场景下效果较好的命名实体识别网络,能够提取输入层的张量的深层次特征,并且减少神经元的个数,增加识别准确率并且降低训练时间,在推理平台使用时,具有巨量模型所欠缺的快速响应优势。
对于置信度,在统计学中,一个概率样本的置信区间(Confidence interval)是对样本的某个总体参数的区间估计,置信区间展现的是参数的真实值有一定概率落在测量结果的周围的程度,置信区间给出的是被测量参数测量值的可信程度范围,即前面所要求的“一定概率”,此概率被称为置信水平,具体地,每个命名实体具有对应的置信度,可以将第一标注数据集输入到预设的筛选网络中,根据筛选网络对命名实体的置信度进行识别与推理,再根据筛选网络的嵌入层结果的置信度对第一标注数据集进行批次区分,其中,可以根据置信度的置信区间将第一标注数据集划分为高置信度数据集、中置信度数据集和低置信度数据集。
在一种示例中,假设高置信度的可信范围区域为(0.8,1],中置信度的可信范围区域为(0.5,0.8],低置信度的可信范围区域为(0,0.5],若命名实体的置信度为0.4,则将该命名实体划分为低置信度区域的数据,由于第一标注数据集中包含有命名实体,因此根据第一标注数据集中所对应的命名实体的总置信度结果来划分置信度批次,若第一标注数据集位于高置信度的可信范围区域(0.8,1]中,则划分为高置信度数据集,同理可得,若第一标注数据集位于中置信度的可信范围区域(0.5,0.8]中,则划分为中置信度数据集,若第一标注数据集位于低置信度的可信范围区域(0,0.5]中,则划分为低置信度数据集。
在本申请一些实施例中,若第一标注数据集位于筛选网络中嵌入层预设的高置信度区域,则将第一标注数据集划分为高置信度数据集,若第一标注数据集位于筛选网络中嵌入层预设的中置信度区域,则将第一标注数据集划分为中置信度数据集,若第一标注数据集位于筛选网络中嵌入层预设的低置信度区域,则将第一标注数据集划分为低置信度数据集。
在一种在一些实施例中实施例中,步骤106、对挖掘困难样本数据集进行数据增强,得到增强数据集,包括:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,得到增强数据集。
对于预训练语言模型的第二Prompt,其可以用于对挖掘困难数据集进行数据增强。
在具体实现中,将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模 型,由预训练语言模型执行预训练语言模型的第二Prompt,得到增强数据集。
在一种在一些实施例中实施例中,挖掘困难样本数据集包含待增强文本,步骤106、对挖掘困难样本数据集进行数据增强,得到增强数据集,方法还包括:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
输出针对挖掘困难样本数据集中所有数据的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
对于预训练语言模型的第二Prompt,其可以用于对挖掘困难数据集进行数据增强,其主要用于生成挖掘困难数据集中所有数据的相似句,预训练语言模型的第二Prompt由增强任务、第二案例和待增强样本构成;其中,第二案例为执行增强任务的案例,第二案例由已标注文本、增强任务和相似句构成,例如,“输入:我从昨天早上到现在,间断腹痛。相似句:我从昨天早上到现在,隔一段时间就腹痛。”。
在具体实现中,将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,由预训练语言模型执行预训练语言模型的第二Prompt,输出针对挖掘困难样本数据集中所有数据的相似句,并将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
需要说明的是,对挖掘困难样本数据集进行数据增强的方法,除本申请一些实施例所采用的生成相似句的方法外,本领域技术人员可以根据实际需要选择对应的数据增强方法,本申请实施例对此不作限制。
在一些实施例中实施例中,挖掘困难样本数据集包含待增强文本,步骤106、对挖掘困难样本数据集进行数据增强,得到增强数据集,包括:
将挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
根据第二案例,输出待增强文本所对应的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
对于预训练语言模型的第二Prompt,其为用于对挖掘困难数据集进行数据增强的方法,预训练语言模型的第二Prompt由增强任务、第二案例和待增强样本构成;其中,第二案例为执行增强任务的案例,第二案例由已标注文本、增强任务和相似句构成,例如,“输入:我从昨天早上到现在,间断腹痛。相似句:我从昨天早上到现在,隔一段时间就腹痛。”。
对于增强任务,其为用自然语言描述需要执行的增强任务,例如,“给出下列语句的相似句”;对于待增强样本,其由待增强文本和增强任务构成,例如,“输入:孩子疝气单侧挺大。相似句:”。
在一种示例中,用于产生数据增强样本的预训练语言模型的第二prompt的执行结构如下:
“给出下列语句的相似句。
输入:我从昨天早上到现在,间断腹痛。
相似句:我从昨天早上到现在,隔一段时间就腹痛。
输入:我最近几个月总是时不时的胸闷。
相似句:我最近几个月经常间歇性的胸闷。
输入:孩子感冒了,夜间一直咳嗽。
相似句:孩子感冒了,晚上一直咳嗽。
输入:孩子疝气单侧挺大。
相似句:”
在此示例中,“给出下列语句的相似句。”为上述所提及的增强任务,其为用自然语言描述需要执行的增强任务;此外“输入:我从昨天早上到现在,间断腹痛。”和“相似句:我从昨天早上到现在,隔一段时间就腹痛。”组成一个案例,同理可得,“输入:我最近几个月总是时不时的胸闷。相似句:我最近几个月经常间歇性的胸闷。”、“输入:孩子感冒了,夜间一直咳嗽。相似句:孩子感冒了,晚上一直咳嗽。”这两个样本也为执行增强任务的相关案例,采用输入文本信息加输出相似句的表现形式;对于待增强样本,其为示例中的“输入:孩子疝气单侧挺大。相似句:”,其表现形式采用的是输入文本信息加增强任务,其中,示例中的待增强样本中的“相似句:”需要由预训练语言模型进行输出,得到相似句。
在本申请一些实施例,将挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,由预训练语言模型执行预训练语言模型的第二Prompt,根据预训练语言模型的第二Prompt中的第二案例,输出待增强文本所对应的相似句,并将该相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一些实施例中实施例中,步骤107、将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集,包括:
将初始标注数据集和多次获取的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合;
判断进行混合的增强数据集的获取次数,当增强数据集的获取次数大于预设执行次数时,输出待标注样本的目标标注数据集。
对于多次获取,其为在将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合之后,当获取增强数据集的次数小于预设执行次数时,返回执行将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集的步骤,以获得多个增强数据集;其中,预设执行次数为三次。
其中,由于获取多次增强数据集需要重新返回执行步骤101,因此会获得多个第一标注数据集、挖掘困难样本数据集和增强数据集。
在具体实现中,将初始标注数据集和多次获取的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合,判断进行混合的增强数据集的获取次数,当增强数据集的获取次数等于预设执行次数时,输出待标注样本的目标标注数据集。
在本申请一些实施例中,首先收集初始文本数据集,接着响应于针对初始文本数据集的标注指令操作,得到初始标注数据集,其中,初始标注数据集包含待标注样本,然后将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集,将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集,响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集,对挖掘困难样本数据集进行数据增强,得到增强数据集,最后,将初始标注数据集、第一标注数据集、挖掘困难样本数据集和增强数据集进行混合, 得到待标注样本的目标标注数据集。本申请通过使用预训练语言模型对文本进行自动标注,快速产生标注标签,大幅度地提高了标注效率,同时对数据集中的困难样本进行了增强,有助于提升下游任务的精度。
为了使本领域技术人员更好地理解本申请一些实施例的技术方案,下面通过一个例子进行示例性说明:
参照图4,示出了本申请一些实施例中提供的一种医疗文本标注方法流程图,由图可知:
S1、收集初始医疗文本数据集,筛选少量数据进行人工标注,人工进行标注的结果称为初始标注数据集,N=1(N为标注次数)。
S2、取上次标注的初始标注数据集的样本及标签构建预训练语言模型的第一prompt,使用预训练语言模型对待标注样本进行标注,并获取标注结果,称为第一标注数据集;
其中,预训练语言模型的第一prompt包含待标注样本,将预训练语言模型的第一prompt作为数据输入到预训练语言模型,由预训练语言模型对预训练语言模型的第一prompt中的待标注样本进行标注。
S3、将预训练语言模型完成标注的第一标注数据集输入筛选网络,根据筛选网络的嵌入层结果置信度对数据集进行批次区分;
S4、抽选置信度较低批次的数据,进行人工标注,获得一次挖掘困难样本数据集;
S5、对一次挖掘困难样本数据集中的数据进行相似句生成,形成一次增强数据集;
S6、混合每次标注完成的数据集与初始标注数据集,
S7、对增强数据集,重复S2-S5过程固定次数,固定次数为三次,判断标注次数N是否小于或等于三次,若标注次数小于等于三次,则返回执行步骤S2;若标注次数大于三次,此时增强数据集的固定次数也已满足三次的条件,输出目标标注数据集。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请一些实施例并不受所描述的动作顺序的限制,因为依据本申请一些实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请一些实施例所必须的。
参照图5,示出了本申请一些实施例中提供的一种文本的标注装置的结构框图,具体可以包括如下模块:
初始文本数据集收集模块501,用于收集初始文本数据集;
初始标注数据集获取模块502,用于响应于针对初始文本数据集的标注指令操作,得到初始标注数据集;其中,初始标注数据集包含待标注样本;
第一标注数据集获取模块503,用于将初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对待标注样本的第一标注数据集;
置信度数据集划分模块504,用于将第一标注数据集输入到预设的筛选网络中,将第一标注数据集划分为不同置信度所对应的置信度数据集;
挖掘困难样本数据集获取模块505,用于响应于针对低于预设置信度的置信度数据集的标注指令操作,得到挖掘困难样本数据集;
增强数据集获取模块506,用于对挖掘困难样本数据集进行数据增强,得到增强数据集;
目标标注数据集获取模块507,用于将初始标注数据集、第一标注数据集、挖掘困难样 本数据集和增强数据集进行混合,得到待标注样本的目标标注数据集。
在一种可选实施例中,第一标注数据集获取模块503具体用于:
将预先构建完成的预训练语言模型的第一Prompt作为数据输入到预训练语言模型,得到待标注样本的第一标注数据集。
在一种可选实施例中,第一标注数据集获取模块503具体用于:
将预训练语言模型的第一Prompt作为数据输入到预训练语言模型,根据第一案例,输出预训练语言模型的第一Prompt中待标注样本中的待标注文本对应的命名实体;
获取命名实体所对应的标签,对命名实体标注标签;
将已进行标签标注的命名实体和待标注文本组成数据对,由多个数据对构成待标注样本的第一标注数据集。
在一种可选实施例中,置信度数据集划分模块504具体用于:
若第一标注数据集位于筛选网络中嵌入层预设的高置信度区域,则将第一标注数据集划分为高置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的中置信度区域,则将第一标注数据集划分为中置信度数据集;
若第一标注数据集位于筛选网络中嵌入层预设的低置信度区域,则将第一标注数据集划分为低置信度数据集。
在一种可选实施例中,增强数据集获取模块506具体用于:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型,得到增强数据集。
在一种可选实施例中,增强数据集获取模块506具体还用于:
将挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
输出针对挖掘困难样本数据集中所有数据的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一种可选实施例中,挖掘困难样本数据集包含待增强文本,增强数据集获取模块506具体用于:
将挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将预训练语言模型的第二Prompt作为数据输入到预训练语言模型;
根据第二案例,输出待增强文本所对应的相似句;
将相似句和挖掘困难样本数据集进行混合,得到增强数据集。
在一种可选实施例中,目标标注数据集获取模块507具体用于:
将初始标注数据集和多次获取的第一标注数据集、挖掘困难样本数据集以及增强数据集进行混合;
判断进行混合的增强数据集的获取次数,当增强数据集的获取次数大于预设执行次数时,输出待标注样本的目标标注数据集。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
另外,本申请一些实施例还提供了一种电子设备,包括:处理器,存储器,存储在存储器上并可在处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述文本的标注方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
图6为本申请一些实施例中提供的一种非易失性可读存储介质的结构示意图。
本申请一些实施例还提供了一种非易失性可读存储介质601,非易失性可读存储介质601上存储有计算机程序,计算机程序被处理器执行时实现上述文本的标注方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,的非易失性可读存储介质601,如只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等。
图7为实现本申请各个实施例的一种电子设备的硬件结构示意图。
该电子设备700包括但不限于:射频单元701、网络模块702、音频输出单元703、输入单元704、传感器705、显示单元706、用户输入单元707、接口单元708、存储器709、处理器710、以及电源711等部件。本领域技术人员可以理解,图7中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。在本申请一些实施例中,电子设备包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备、以及计步器等。
应理解的是,本申请一些实施例中,射频单元701可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器710处理;另外,将上行的数据发送给基站。通常,射频单元701包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元701还可以通过无线通信系统与网络和其他设备通信。
电子设备通过网络模块702为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。
音频输出单元703可以将射频单元701或网络模块702接收的或者在存储器709中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元703还可以提供与电子设备700执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元703包括扬声器、蜂鸣器以及受话器等。
输入单元704用于接收音频或视频信号。输入单元704可以包括图形处理器(Graphics Processing Unit,GPU)7041和麦克风7042,图形处理器7041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。处理后的图像帧可以显示在显示单元706上。经图形处理器7041处理后的图像帧可以存储在存储器709(或其它存储介质)中或者经由射频单元701或网络模块702进行发送。麦克风7042可以接收声音,并且能够将这样的声音处理为音频数据。处理后的音频数据可以在电话通话模式的情况下转换为可经由射频单元701发送到移动通信基站的格式输出。
电子设备700还包括至少一种传感器705,比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板7061的亮度,接近传感器可在电子设备700移动到耳边时,关闭显示面板7061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备姿态(比如横 竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器705还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。
显示单元706用于显示由用户输入的信息或提供给用户的信息。显示单元706可包括显示面板7061,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板7061。
用户输入单元707可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。具体地,用户输入单元707包括触控面板7071以及其他输入设备7072。触控面板7071,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板7071上或在触控面板7071附近的操作)。触控面板7071可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器710,接收处理器710发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板7071。除了触控面板7071,用户输入单元707还可以包括其他输入设备7072。具体地,其他输入设备7072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。
进一步的,触控面板7071可覆盖在显示面板7061上,当触控面板7071检测到在其上或附近的触摸操作后,传送给处理器710以确定触摸事件的类型,随后处理器710根据触摸事件的类型在显示面板7061上提供相应的视觉输出。虽然在图7中,触控面板7071与显示面板7061是作为两个独立的部件来实现电子设备的输入和输出功能,但是在某些实施例中,可以将触控面板7071与显示面板7061集成而实现电子设备的输入和输出功能,具体此处不做限定。
接口单元708为外部装置与电子设备700连接的接口。例如,外部装置可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元708可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到电子设备700内的一个或多个元件或者可以用于在电子设备700和外部装置之间传输数据。
存储器709可用于存储软件程序以及各种数据。存储器709可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器709可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器710是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器709内的软件程序和/或模块,以及调用存储在存储器709内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。处理器710可包括一个或多个处理单元;优选的,处理器710可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线 通信。可以理解的是,上述调制解调处理器也可以不集成到处理器710中。
电子设备700还可以包括给各个部件供电的电源711(比如电池),优选的,电源711可以通过电源管理系统与处理器710逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
另外,电子设备700包括一些未示出的功能模块,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
本领域普通技术人员可以意识到,结合本申请一些实施例中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个 计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种文本的标注方法,其特征在于,包括:
    收集初始文本数据集;
    响应于针对所述初始文本数据集的标注指令操作,得到初始标注数据集;其中,所述初始标注数据集包含待标注样本;
    将所述初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对所述待标注样本的第一标注数据集;
    将所述第一标注数据集输入到预设的筛选网络中,将所述第一标注数据集划分为不同置信度所对应的置信度数据集;
    响应于针对低于预设置信度的所述置信度数据集的标注指令操作,得到挖掘困难样本数据集;
    对所述挖掘困难样本数据集进行数据增强,得到增强数据集;
    将所述初始标注数据集、所述第一标注数据集、所述挖掘困难样本数据集和所述增强数据集进行混合,得到所述待标注样本的目标标注数据集。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对所述待标注样本的第一标注数据集,包括:
    将预先构建完成的预训练语言模型的第一Prompt作为数据输入到所述预训练语言模型,得到所述待标注样本的第一标注数据集。
  3. 根据权利要求2所述的方法,其特征在于,所述预训练语言模型的第一Prompt由标注任务、第一案例和待标注样本构成。
  4. 根据权利要求3所述的方法,其特征在于,所述第一案例为执行所述标注任务的案例,所述第一案例由已标注文本、标注任务和命名实体构成,所述已标注文本包含命名实体,所述待标注样本由待标注文本和标注任务构成,所述待标注文本包含命名实体。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对所述待标注样本的第一标注数据集,包括:
    将所述预训练语言模型的第一Prompt作为数据输入到所述预训练语言模型,根据所述第一案例,输出所述预训练语言模型的第一Prompt中待标注样本的待标注文本对应的命名实体;
    获取所述命名实体所对应的标签,对所述命名实体标注所述标签;
    将已进行标签标注的命名实体和所述待标注文本组成数据对,由多个所述数据对构成所述待标注样本的第一标注数据集。
  6. 根据权利要求4所述的方法,其特征在于,所述命名实体至少包括症状主体、症状主体的修饰、症状描述、症状描述的修饰、检查项名以及检查结果,所述症状描述的修饰至少包括性质、频率、时间、条件以及程度。
  7. 根据权利要求1所述的方法,其特征在于,所述预设的筛选网络包括输入层、嵌入层、长短期记忆层、注意力层、条件随机场层以及分类网络层,其中,所述嵌入层包含置信度,用于将所述第一标注数据集划分为不同置信度所对应的置信度数据集。
  8. 根据权利要求1所述的方法,其特征在于,所述置信度数据集包括高置信度数据 集、中置信度数据集和低置信度数据集。
  9. 根据权利要求1所述的方法,其特征在于,所述将所述第一标注数据集输入到预设的筛选网络中,将所述第一标注数据集划分为不同置信度所对应的置信度数据集,包括:
    若所述第一标注数据集位于所述筛选网络中嵌入层预设的高置信度区域,则将所述第一标注数据集划分为高置信度数据集;
    若所述第一标注数据集位于所述筛选网络中嵌入层预设的中置信度区域,则将所述第一标注数据集划分为中置信度数据集;
    若所述第一标注数据集位于所述筛选网络中嵌入层预设的低置信度区域,则将所述第一标注数据集划分为低置信度数据集。
  10. 根据权利要求1所述的方法,其特征在于,所述对所述挖掘困难样本数据集进行数据增强,得到增强数据集,包括:
    将所述挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将所述预训练语言模型的第二Prompt作为数据输入到所述预训练语言模型,得到增强数据集。
  11. 根据权利要求10所述的方法,其特征在于,所述对所述挖掘困难样本数据集进行数据增强,得到增强数据集,所述方法还包括:
    将所述挖掘困难样本数据集中的所有数据输入到预先训练完成的预训练语言模型的第二Prompt,并且,将所述预训练语言模型的第二Prompt作为数据输入到所述预训练语言模型;
    输出针对所述挖掘困难样本数据集中所有数据的相似句;
    将所述相似句和所述挖掘困难样本数据集进行混合,得到增强数据集。
  12. 根据权利要求11所述的方法,其特征在于,所述预训练语言模型的第二Prompt由增强任务、第二案例和待增强样本构成。
  13. 根据权利要求12所述的方法,其特征在于,所述第二案例为执行所述增强任务的案例,所述第二案例由已标注文本、增强任务和相似句构成。
  14. 根据权利要求13所述的方法,其特征在于,所述挖掘困难样本数据集包含待增强文本,所述对所述挖掘困难样本数据集进行数据增强,得到增强数据集,包括:
    将所述挖掘困难样本数据集中的待增强文本输入到预训练语言模型的第二Prompt,并且,将所述预训练语言模型的第二Prompt作为数据输入到所述预训练语言模型;
    根据所述第二案例,输出所述待增强文本所对应的相似句;
    将所述相似句和所述挖掘困难样本数据集进行混合,得到增强数据集。
  15. 根据权利要求1所述的方法,其特征在于,在所述对所述挖掘困难样本数据集进行数据增强,得到增强数据集之后,还包括:
    对所述增强数据集进行多次获取,得到多个增强数据集。
  16. 根据权利要求15所述的方法,其特征在于,在所述将所述初始标注数据集、所述第一标注数据集、所述挖掘困难样本数据集和所述增强数据集进行混合之后,还包括:
    当获取所述增强数据集的次数小于或等于预设执行次数时,返回执行所述将所述初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对所述待标注样本的第一标注数据集的步骤。
  17. 根据权利要求16所述的方法,其特征在于,所述将所述初始标注数据集、所述第一标注数据集、所述挖掘困难样本数据集和所述增强数据集进行混合,得到所述待标注样本的目标标注数据集,包括:
    将所述初始标注数据集和多次获取的所述第一标注数据集、所述挖掘困难样本数据集以及所述增强数据集进行混合;
    判断进行混合的所述增强数据集的获取次数,当所述增强数据集的获取次数大于所述预设执行次数时,输出所述待标注样本的目标标注数据集。
  18. 一种文本的标注装置,其特征在于,包括:
    初始文本数据集收集模块,用于收集初始文本数据集;
    初始标注数据集获取模块,用于响应于针对所述初始文本数据集的标注指令操作,得到初始标注数据集;其中,所述初始标注数据集包含待标注样本;
    第一标注数据集获取模块,用于将所述初始标注数据集中的待标注样本输入到预先训练完成的预训练语言模型中,输出针对所述待标注样本的第一标注数据集;
    置信度数据集划分模块,用于将所述第一标注数据集输入到预设的筛选网络中,将所述第一标注数据集划分为不同置信度所对应的置信度数据集;
    挖掘困难样本数据集获取模块,用于响应于针对低于预设置信度的所述置信度数据集的标注指令操作,得到挖掘困难样本数据集;
    增强数据集获取模块,用于对所述挖掘困难样本数据集进行数据增强,得到增强数据集;
    目标标注数据集获取模块,用于将所述初始标注数据集、所述第一标注数据集、所述挖掘困难样本数据集和所述增强数据集进行混合,得到所述待标注样本的目标标注数据集。
  19. 一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,所述处理器、所述通信接口以及所述存储器通过所述通信总线完成相互间的通信;
    所述存储器,用于存放计算机程序;
    所述处理器,用于执行存储器上所存放的程序时,实现如权利要求1-17任一项所述的方法。
  20. 一种非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得所述处理器执行如权利要求1-17任一项所述的方法。
PCT/CN2023/101690 2022-12-05 2023-06-21 一种文本的标注方法、装置、电子设备及可读存储介质 WO2024119773A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211549045.5 2022-12-05
CN202211549045.5A CN115640808B (zh) 2022-12-05 2022-12-05 一种文本的标注方法、装置、电子设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2024119773A1 true WO2024119773A1 (zh) 2024-06-13

Family

ID=84949263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101690 WO2024119773A1 (zh) 2022-12-05 2023-06-21 一种文本的标注方法、装置、电子设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN115640808B (zh)
WO (1) WO2024119773A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115640808B (zh) * 2022-12-05 2023-03-21 苏州浪潮智能科技有限公司 一种文本的标注方法、装置、电子设备及可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347769A (zh) * 2020-10-30 2021-02-09 北京百度网讯科技有限公司 实体识别模型的生成方法、装置、电子设备及存储介质
US11048979B1 (en) * 2018-11-23 2021-06-29 Amazon Technologies, Inc. Active learning loop-based data labeling service
US20210209463A1 (en) * 2020-01-08 2021-07-08 International Business Machines Corporation Dual model incremental learning
US20210287084A1 (en) * 2020-03-13 2021-09-16 International Business Machines Corporation Determining optimal augmentations for a training data set
CN113590764A (zh) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 训练样本构建方法、装置、电子设备和存储介质
CN113901823A (zh) * 2021-10-22 2022-01-07 平安科技(深圳)有限公司 命名实体识别方法、装置、存储介质及终端设备
CN114022737A (zh) * 2021-11-16 2022-02-08 胜斗士(上海)科技技术发展有限公司 对训练数据集进行更新的方法和设备
CN115640808A (zh) * 2022-12-05 2023-01-24 苏州浪潮智能科技有限公司 一种文本的标注方法、装置、电子设备及可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704633B (zh) * 2019-09-04 2023-07-21 平安科技(深圳)有限公司 命名实体识别方法、装置、计算机设备及存储介质
CN111783518A (zh) * 2020-05-14 2020-10-16 北京三快在线科技有限公司 训练样本生成方法、装置、电子设备及可读存储介质
CN111859953B (zh) * 2020-06-22 2023-08-22 北京百度网讯科技有限公司 训练数据的挖掘方法、装置、电子设备及存储介质
CN114861600B (zh) * 2022-07-07 2022-12-13 之江实验室 一种面向ner的中文临床文本数据增强方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11048979B1 (en) * 2018-11-23 2021-06-29 Amazon Technologies, Inc. Active learning loop-based data labeling service
US20210209463A1 (en) * 2020-01-08 2021-07-08 International Business Machines Corporation Dual model incremental learning
US20210287084A1 (en) * 2020-03-13 2021-09-16 International Business Machines Corporation Determining optimal augmentations for a training data set
CN112347769A (zh) * 2020-10-30 2021-02-09 北京百度网讯科技有限公司 实体识别模型的生成方法、装置、电子设备及存储介质
CN113590764A (zh) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 训练样本构建方法、装置、电子设备和存储介质
CN113901823A (zh) * 2021-10-22 2022-01-07 平安科技(深圳)有限公司 命名实体识别方法、装置、存储介质及终端设备
CN114022737A (zh) * 2021-11-16 2022-02-08 胜斗士(上海)科技技术发展有限公司 对训练数据集进行更新的方法和设备
CN115640808A (zh) * 2022-12-05 2023-01-24 苏州浪潮智能科技有限公司 一种文本的标注方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN115640808A (zh) 2023-01-24
CN115640808B (zh) 2023-03-21

Similar Documents

Publication Publication Date Title
US10762450B2 (en) Diagnosis-driven electronic charting
US10950332B2 (en) Targeted sensation of touch
Abu-Naser et al. Knowledge management in ESMDA: expert system for medical diagnostic assistance
WO2016168980A1 (zh) 一种生理体征信息获取方法和系统
CN110675951A (zh) 智能化的疾病诊断方法及装置、计算机设备与可读介质
US20140073882A1 (en) Clinical diagnosis objects authoring
CN110390841A (zh) 数字病人的问诊训练方法、终端与系统
CN106407666A (zh) 一种电子病历信息的生成方法、装置及系统
WO2024119773A1 (zh) 一种文本的标注方法、装置、电子设备及可读存储介质
US10698983B2 (en) Wireless earpiece with a medical engine
WO2012003397A2 (en) Diagnosis-driven electronic charting
WO2021121226A1 (zh) 一种心电信号的预测方法、装置、终端以及存储介质
CN108292306A (zh) 电子临床自由文本的阅读者驱动的释义
CN103761437A (zh) 一种基于临床数据的科研数据自动生成系统
TW201606690A (zh) 護理決策輔助系統
Beck et al. Wearable, multimodal, vitals acquisition unit for intelligent field triage
CN111653273A (zh) 一种基于智能手机的院外肺炎初步识别方法
US20200090813A1 (en) Method of Constructing Database
US20040225476A1 (en) Inspection apparatus for diagnosis
CN109310403A (zh) 女性的月经周期期间的面部特性的光学监测
CN117037571A (zh) 一种基于ai人工智能辅助沟通系统
CN111840081A (zh) 一种用药提醒方法、系统、计算机设备
CN201996534U (zh) 临床医学智能诊疗系统
US20230018077A1 (en) Medical information processing system, medical information processing method, and storage medium
CN104887189A (zh) 一种基于医嘱的体征监测提示反馈方法