WO2020215456A1 - 一种基于教师监督的文本标注方法和设备 - Google Patents

一种基于教师监督的文本标注方法和设备 Download PDF

Info

Publication number
WO2020215456A1
WO2020215456A1 PCT/CN2019/090336 CN2019090336W WO2020215456A1 WO 2020215456 A1 WO2020215456 A1 WO 2020215456A1 CN 2019090336 W CN2019090336 W CN 2019090336W WO 2020215456 A1 WO2020215456 A1 WO 2020215456A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeling
character
text
word segmentation
model
Prior art date
Application number
PCT/CN2019/090336
Other languages
English (en)
French (fr)
Inventor
蔡子健
李金锋
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to EP19886050.4A priority Critical patent/EP3751445A4/en
Priority to US16/888,591 priority patent/US20200380209A1/en
Publication of WO2020215456A1 publication Critical patent/WO2020215456A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • This application relates to the technical field of natural language processing, and in particular to a text labeling method and device based on teacher supervision.
  • Natural Language Processing (NLP) technology can efficiently perform systematic analysis, understanding and information extraction of text data, so that computers can understand natural language and generate natural language, thereby realizing the effective use of natural language between humans and computers Interaction (such as the use of automatic message reply, voice assistant and other applications).
  • NLP Natural Language Processing
  • the text annotation technology provides a foundation for the industrial application of natural language processing.
  • the text annotation technology for Chinese usually uses deep learning based on character granularity
  • the model performs annotation processing on the text to be annotated. Due to the continuous development of natural language processing technology, the existing deep learning model based on character granularity is not enough to meet the increasing accuracy requirements of natural language processing technology for text annotation. Moreover, when a well-trained deep learning model is applied to a new field, the recall rate of the deep learning model is insufficient or even zero, resulting in poor generalization of the deep learning model, and word edge labeling is easy to solidify.
  • the purpose of some embodiments of this application is to provide a text labeling method and device based on teacher supervision.
  • the technical solution is as follows:
  • a text labeling method based on teacher supervision includes:
  • the character tagging result is re-labeled based on the segmented word to obtain the fusion tagging result and output.
  • the character labeling model before using the character labeling model to perform labeling processing on the text to be labelled, and before generating a character labeling result containing labelled words, it further includes:
  • the initial character labeling model is trained by using the labelled text in the training sample set to generate the character labeling model.
  • re-labeling the character tagging result based on the segmented word, and obtaining the fusion tagging result further includes:
  • the training the character labeling model based on the fusion labeling result and the training sample set includes:
  • the method before using the new training sample set to train the character annotation model, the method further includes:
  • word segmentation model fails to perform word segmentation processing on the text to be labeled, adding the character labeling result to the reclaimed label set;
  • a preset number of the character annotation results are extracted from the recycled annotation set and added to the new training sample set.
  • performing word segmentation processing on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation words includes:
  • the word segmentation process is performed on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation words.
  • re-labeling the character tagging result based on the segmented word to obtain the fusion tagging result includes:
  • the method further includes:
  • the confidence threshold and the similarity threshold are updated according to the number of times of training of the character labeling model according to a preset decreasing function.
  • a text annotation device based on teacher supervision includes:
  • the character labeling module is used for labeling the text to be labelled using the character labeling model to generate character labeling results containing labelled words;
  • the word segmentation module is used to perform word segmentation processing on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation word;
  • the fusion labeling module is used for re-labeling the character labeling result based on the word segmentation words according to the similarity between each labeling word and each word segmentation word to obtain and output the fusion labeling result.
  • the character labeling module is also used for:
  • the initial character labeling model is trained by using the labelled text in the training sample set to generate the character labeling model.
  • the character labeling module is also used for:
  • the fusion labeling module is also used for:
  • the word segmentation module is also used for:
  • word segmentation model fails to perform word segmentation processing on the text to be labeled, adding the character labeling result to the reclaimed label set;
  • a preset number of the character annotation results are extracted from the recycled annotation set and added to the new training sample set.
  • the word segmentation module is specifically used for:
  • the word segmentation process is performed on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation words.
  • the fusion labeling module is specifically used for:
  • the fusion labeling module is also used for:
  • the confidence threshold and the similarity threshold are updated according to the number of times of training of the character labeling model according to a preset decreasing function.
  • a text tagging device based on teacher supervision includes a processor and a memory.
  • the memory stores at least one instruction, at least one program, code set or instruction set, and the at least one instruction
  • the at least one program, the code set or the instruction set is loaded and executed by the processor to implement the text annotation method based on teacher supervision as described in the first aspect.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code
  • the set or instruction set is loaded and executed by the processor to implement the text annotation method based on teacher supervision as described in the first aspect.
  • the word segmentation model uses the word segmentation model to check and correct the character labeling results of the character labeling model, which improves the accuracy and reliability of the character labeling model for labeling the text to be labeled.
  • the finally obtained fusion annotation results as training samples to train the character annotation model, and then perform annotation processing on the remaining text to be annotated, optimize the model parameters required for the text annotation task, and make the character annotation results more credible.
  • the text labeling device can quickly check and correct the character labeling results through the teacher supervision algorithm, and use the fusion labeling results to strengthen the training of the character labeling model to improve the accuracy of the character labeling model .
  • adding the to-be-labeled text containing new words that cannot be recognized by the word segmentation model to the training sample set can enhance the generalization of the character labeling model, avoiding the solidification of the edge of the word, and thereby improving the recall rate of the character labeling model .
  • Figure 1 is a flowchart of a text labeling method based on teacher supervision provided by an embodiment of the application
  • Figure 2 is a logical schematic diagram of a text labeling method based on teacher supervision provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of functional modules of a text labeling device based on teacher supervision provided by an embodiment of the application;
  • Fig. 4 is a schematic structural diagram of a text labeling device based on teacher supervision provided by an embodiment of the application.
  • the embodiment of the application provides a text labeling method based on teacher supervision.
  • the execution body of the method can be a text labeling device.
  • the text labeling device can use a deep learning model based on character granularity (which can be called a character labeling model) to A large number of texts to be labeled in the text labeling task are labeled, and then the same text to be labeled is processed for word segmentation (which can be called word segmentation) through a language model based on word granularity (which can be called a word segmentation model), and then words are used
  • the segmentation result (which can be called the word segmentation result) checks and corrects the preliminary labeling result (which can be called the character labeling result), and the fusion labeling result is used as the final labeling result of the text to be labeled.
  • the above-mentioned text labeling device may include a processor and a memory.
  • the processor may be used to perform text labeling processing in the following procedures, and the memory may be used
  • Figure 1 is a flowchart of a text labeling method based on teacher supervision provided by an embodiment of this application
  • Figure 2 shows the implementation logic of the text labeling method based on teacher supervision , Where the serial numbers 1 to 11 respectively represent the logical sequence of each processing in the process of the text labeling device executing the text labeling method.
  • Step 101 The text labeling device uses the character labeling model to perform labeling processing on the text to be labelled to generate a character labeling result containing the labelled words.
  • a text to be labeled usually contains one or more words identified by a name, and the word can be a single Characters can also be composed of more than two characters. Therefore, the text labeling device can use the character labeling model to predict the label corresponding to each character in each text to be labeled, and then recognize one or more words contained in each text to be labeled, and generate one or more labeled words. The characters mark the result.
  • NER Named Entity Recognition
  • LOC-B first word of place name
  • LOC-I non-first word of place name
  • ORG-B first word of organization
  • ORG-I non-organization First word
  • O non-named entity
  • the text labeling device uses the named entity recognition model to initially label each character in the text to be labeled: day/LOC-B, this/LOC-I, ⁇ /O, ⁇ /ORG -B, ⁇ /ORG-I, mountain/ORG-I, based on the preliminary labeling result, the text labeling device can generate a character labeling result containing the two labeling words "Japan” and "Mount Fuji".
  • the above-mentioned tags are preset by technicians, and different text labeling tasks can have different tags.
  • a text to be annotated mainly in Chinese may contain a bilingual named entity "IP address".
  • the character annotation model can annotate the English in the text to be annotated based on English word granularity (word granularity).
  • the text labeling device can label the English word "IP” as the first word, and both " ⁇ " and "address" as non-initial words.
  • the text labeling device may use a predetermined amount of labeled text to train the initial character labeling model before labeling the text with the character labeling model.
  • the processing before step 101 may be as follows: the text labeling device uses the labelled text in the training sample set to train the initial character labeling model to generate the character labeling model.
  • the technician can manually label a small amount of text to be labeled in the text labeling task to obtain a training sample set containing multiple labeled texts.
  • the text labeling device uses a plurality of manually labelled texts in the training sample set to train the initial character labeling model to generate a character labeling model. It can be understood that there are certain differences in the features of the text to be labeled in different text labeling tasks. Therefore, for different text labeling tasks, the character labeling model needs to predict the label corresponding to each character in each text to be labeled The model parameters will also be different.
  • the text tagging device For a text tagging task, the text tagging device needs to use the training sample set corresponding to the text tagging task to train the initial character tagging model, so as to obtain the model parameters required by the text tagging task, and initially generate the character tagging suitable for the text tagging task model.
  • Step 102 The text tagging device performs word segmentation processing on the to-be-annotated text through a preset word segmentation model, and generates a word segmentation result containing the segmented words.
  • technicians can choose a language model based on word granularity that has the same language representation characteristics as the character tagging model (such as Chinese Segmentation System), deep contextualized word vector model (Embedding From Language Model, ELMo ), Knowledge Graph (Knowledge Graph), etc.) to enable the text annotation device to make fine adjustments to the pre-trained language model through migration learning (for example, the text annotation device uses multiple manually annotated text pairs in the training sample set
  • the pre-trained language model is retrained) to obtain a language model (word segmentation model) suitable for the current text labeling task. There is no need to train a word segmentation model from scratch to reduce model training time.
  • the text labeling device can perform word segmentation processing on the text to be labeled through the word segmentation model, and generate a word segmentation result containing the word segmentation words.
  • the word segmentation model as the Chinese word segmentation system as an example
  • the text tagging device can use the Chinese word segmentation system to perform word segmentation processing on the labeled text "Japan’s Fuji Mountain” to generate three segmentation words containing " ⁇ ", " ⁇ ” and "Fuji Mountain” Word segmentation result.
  • the labeled words in the character labeling result generated by the text labeling device using the character labeling model may be labeled incorrectly, so a confidence threshold can be preset to evaluate whether the character labeling result is credible.
  • the specific processing of step 102 may be as follows: if the average confidence of the character labeling result exceeds the confidence threshold, the text labeling device performs word segmentation processing on the to-be-labeled text through a preset word segmentation model, and generates a word segmentation result containing word segmentation words.
  • the text labeling device when the text labeling device uses the character labeling model to label the text to be labelled, it can calculate the confidence level of the preliminary labeling result for each character, and calculate the average of the confidence levels corresponding to all characters of the text to be labelled, and obtain the The average confidence of the character annotation results of the annotated text. When the average confidence of the character labeling result exceeds the confidence threshold, it means that the character labeling result is credible to a certain extent.
  • the text labeling device can perform word segmentation processing on the labelled text through the preset word segmentation model. Use the word segmentation result to check whether the character marking result is correctly marked, and correct the incorrectly marked words in the character marking result.
  • the calculation method of the confidence of the preliminary labeling results of each character can be as follows: the text labeling device uses the LSTM layer (Long Short-Term Memory) of the named entity recognition model to first calculate each of the text to be labeled The characters are marked as the preset scores of each label, and then the CRF layer (Conditional Random Fields) of the named entity recognition model is used to generate the character labeling results and the character labeling results according to the scores of each label corresponding to each character. Confidence of initial labeling results of characters. Among them, the confidence is the output result of the CRF layer, and the specific calculation process is not described in detail in this application.
  • Step 103 The text labeling device re-labels the character labeling results based on the word segmentation words according to the similarity between each label word and each word segmentation word, and obtains and outputs the fusion labeling result.
  • the text labeling device can use the word segmentation result generated by the word segmentation model to check whether the character labeling result generated by the character labeling model is correctly labeled.
  • the text tagging device can use statistics-based machine learning algorithms (for example, TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-inverse text frequency) combined with Cosine Similarity, Hamming distance or SimHash Etc.) Calculate the similarity between the character annotation result and the word segmentation result. The greater the value of the similarity between the tagged word and the segmented word, the more similar the attributes and functions of the tagged word and the segmented word.
  • TF-IDF Term Frequency-Inverse Document Frequency, word frequency-inverse text frequency
  • the text tagging device can re-label the result of the character based on the segmented word Perform character labeling to obtain the fusion labeling result, and output the fusion labeling result as the labeling result.
  • the text labeling device can separately calculate the similarities between all labelled words in the character labeling result and all word segmentation words in the word segmentation result.
  • the specific processing of step 103 can be as follows: text labeling device Arrange and combine each tagged word in the character tagging result with each segmented word in the word segmentation result to obtain related word pairs; the text tagging device calculates the similarity of all related word pairs and uses related words whose similarity exceeds the similarity threshold The word segmentation words in the pair replace the labeled words; the text labeling device re-labels the replaced character labeling results to obtain the fusion labeling result.
  • the preliminary result of the text labeling device on "Mount Fuji of Japan” through the named entity recognition model may be: day/LOC-B, this/LOC-I
  • the character labeling results generated by the text labeling device are "Japanese” and "Shishan”
  • the text labeling device is the original /LOC-I, rich/O, ⁇ /ORG-B, and mountain/ORG-I.
  • the word segmentation results generated by the Chinese word segmentation system are " ⁇ ", " ⁇ ” and "Fuji Mountain”.
  • all related word pairs obtained by the text labeling device permuting and combining the character labeling results and word segmentation results are: ( ⁇ , ⁇ ), ( ⁇ , ⁇ ), ( ⁇ , Mt. Fuji), (Shishan, Japan ), ( ⁇ , ⁇ ) and ( ⁇ , Fuji Mountain).
  • the text tagging device can use statistically-based machine learning algorithms to calculate the related word pairs whose similarity exceeds the similarity threshold as (Japan, Japan) and (Shishan, Fuji Mountain).
  • the text tagging device uses the word segmentation word "Japan "” and "Fuji Mountain” respectively replace the corresponding labeled words " ⁇ " and " ⁇ ” in the character labeling results.
  • the text labeling device can relabel the replaced character labeling result to obtain a fusion labeling result: Japanese/LOC-B/LOC-I/O Fu/ORG-B Taxi/ORG-I Mountain/ORG-I.
  • the fusion labeling result can be used as a training sample for enhanced training of the character labeling model.
  • the processing after step 103 can also be as follows: the text labeling device trains the character labeling model based on the fusion labeling result and the training sample set.
  • the fusion labeling result can be used as a training sample to train the character labeling model (it can be called iterative training), while using training samples
  • the labeled text in the collection trains the character labeling model to enhance the weight of correctly labeled labeled words.
  • the confidence threshold can be appropriately lowered, and the corresponding processing can also be as follows: the text labeling device updates the confidence threshold and similarity according to the number of training times of the character labeling model according to a preset decreasing function Threshold.
  • the text labeling device can use the word segmentation model to detect more characters.
  • the annotation results are checked; on the other hand, the number of new words encountered by the word segmentation model will continue to decrease.
  • the result of fusion annotation cannot be recalled. Therefore, the text labeling device can update the confidence threshold and the similarity threshold according to the number of training times of the character labeling model according to the preset decreasing function.
  • the constant a represents the maximum value of the confidence threshold, and the value range is (0, 1);
  • the constant b represents the maximum value of the similarity threshold, and the value range is (0, 1);
  • time_step represents the training step of the character annotation model Longer, the more iterative training times of the character annotation model, the larger the value of training step. It can be understood that the technician can set the size of the constant a and the constant b based on experience, and this application does not limit this.
  • the total number of training samples used when training the character labeling model can be basically the same.
  • the text labeling device trains the character labeling model based on the fusion labeling result and the training sample set.
  • the specific processing of the character labeling model can be as follows: the text labeling device adds the fusion labeling result To the fusion label set; the text labeling device extracts a preset number of labeled texts from the fusion label set and the training sample set to generate a new training sample set; the text labeling device uses the new training sample set to train the character labeling model.
  • text labeling equipment usually needs to perform multiple iteration training on the character labeling model to obtain a character labeling model with good performance, so as to perform accurate labeling processing on more text to be labelled.
  • the text labeling device can generate a fusion labeling result, and add the fusion labeling result to the fusion label set.
  • the text labeling device can extract a preset number of labeled texts from the fusion label set and the original training sample set to form a new training sample set, and the text labeling device can use the new training sample set to train the character labeling model , So as to optimize the model parameters of the character annotation model.
  • the total number of training samples used when training the character labeling model can be basically kept at 1000, and the text labeling device can randomly select 600 labeled texts from the fusion labeling set, and randomly select 400 from the original training sample set.
  • the labeled texts are merged into a new training sample set with a total of 1000 training samples, and then the new training sample set is used to retrain the character labeling model.
  • the text labeling device can randomly extract the labeled text from the fusion label set and the training sample set according to a certain ratio (for example, 3:2) while keeping the total number of training samples basically unchanged, to form a new training sample set .
  • the performance of the character labeling model can be improved to a certain extent under the supervision of the word segmentation model, and it can quickly approach or reach the performance of the word segmentation model. Therefore, the number of fusion annotation results in the fusion annotation set can increase with The number of iterative training of character labeling model is increasing continuously. It can be understood that when the number of fusion annotation results in the fusion annotation set no longer changes, it means that the performance of the character annotation model may not have been optimized due to the latest iterative training. It can be considered that the performance of the character annotation model has reached the best, and the text annotation The device can suspend the iterative training of the character annotation model.
  • the text labeling device can add the to-be-labeled text containing new words that have been marked by the character labeling model into a new training sample set to improve recall rate.
  • the processing before the text labeling device uses the new training sample set to train the character labeling model can also be as follows: if the word segmentation model fails to segment the text to be labelled, the text labeling device adds the character labeling result to the reclaimed label set ; The text labeling device extracts a preset number of character labeling results from the reclaimed label set and adds it to the new training sample set.
  • the text-labeling device can use the character labeling model to label the text to be labelled and generate character labeling results.
  • the word segmentation model will not be able to recognize and thus cannot The word segmentation result is generated to supervise the character labeling result generated by the character labeling model, and the fusion labeling result with correct labeling cannot be generated.
  • the text labeling device can add the text to be labeled that cannot be accurately recognized by the word segmentation model but correctly labeled by the character labeling model to the reclaimed label set.
  • the text labeling device randomly selects a preset number of character labeling results from the reclaimed label set and adds it to
  • the new training sample set can train the character labeling model again, which can improve the recall rate of the character labeling model. It can be understood that if the character annotation results in the reclaimed annotation set are incorrect, random sampling can avoid the inflow of a large number of incorrect character annotation results. Judging from the marked text that has been learned by the character annotation model, the probability that the incorrectly marked character annotation results will appear again is higher.
  • the iterative training of the character labeling model through teacher forcing can bring about the following beneficial effects: First, the word segmentation model is used to check and correct the character labeling results of the character labeling model, which improves The accuracy and reliability of the character labeling model for labeling the text to be labelled. Second, use the finally obtained fusion labeling results as training samples to train the character labeling model, and then label the remaining text to be labelled, optimize the model parameters required for the text labeling task, and make the character labeling results more credible.
  • the text labeling device can quickly check and correct the character labeling results through the teacher supervision algorithm, and use the fusion labeling results to strengthen the training of the character labeling model to improve the accuracy of the character labeling model .
  • adding the to-be-labeled text containing new words that cannot be recognized by the word segmentation model to the training sample set can enhance the generalization of the character labeling model, avoiding the solidification of the edge of the word, and thereby improving the recall rate of the character labeling model .
  • an embodiment of the present application also provides a text tagging device based on teacher supervision. As shown in FIG. 3, the device includes:
  • the character labeling module is used for labeling the text to be labelled using the character labeling model to generate character labeling results containing labelled words;
  • the word segmentation module is used to perform word segmentation processing on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation word;
  • the fusion labeling module is used for re-labeling the character labeling result based on the word segmentation words according to the similarity between each labeling word and each word segmentation word to obtain and output the fusion labeling result.
  • the character labeling module is also used for:
  • the initial character labeling model is trained by using the labelled text in the training sample set to generate the character labeling model.
  • the character labeling module is also used for:
  • the fusion labeling module is also used for:
  • the word segmentation module is also used for:
  • word segmentation model fails to perform word segmentation processing on the text to be labeled, adding the character labeling result to the reclaimed label set;
  • a preset number of the character annotation results are extracted from the recycled annotation set and added to the new training sample set.
  • the word segmentation module is specifically used for:
  • the word segmentation process is performed on the to-be-labeled text through a preset word segmentation model to generate a word segmentation result containing the word segmentation words.
  • the fusion labeling module is specifically used for:
  • the fusion labeling module is also used for:
  • the confidence threshold and the similarity threshold are updated according to the number of times of training of the character labeling model according to a preset decreasing function.
  • the iterative training of the character labeling model through teacher forcing can bring about the following beneficial effects: First, the word segmentation model is used to check and correct the character labeling results of the character labeling model, which improves The accuracy and reliability of the character labeling model for labeling the text to be labelled. Second, use the finally obtained fusion labeling results as training samples to train the character labeling model, and then label the remaining text to be labelled, optimize the model parameters required for the text labeling task, and make the character labeling results more credible.
  • the text labeling device can quickly check and correct the character labeling results through the teacher supervision algorithm, and use the fusion labeling results to strengthen the training of the character labeling model to improve the accuracy of the character labeling model .
  • adding the to-be-labeled text containing new words that cannot be recognized by the word segmentation model to the training sample set can enhance the generalization of the character labeling model, avoiding the solidification of the edge of the word, and thereby improving the recall rate of the character labeling model .
  • the text labeling device based on teacher supervision provided in the above embodiment performs text labeling
  • only the division of the above functional modules is used for illustration.
  • the above function can be assigned to different functions according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above.
  • the text labeling device based on teacher supervision provided in the above embodiment belongs to the same concept as the embodiment of the text labeling method based on teacher supervision. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • Fig. 4 is a schematic structural diagram of a text labeling device based on teacher supervision provided by an embodiment of the present application.
  • the teacher-supervised text annotation device 400 may have relatively large differences due to different configurations or performances, and may include more than one central processing unit 422 (for example, more than one processor) and memory 432, and more than one storage application program 442 or data 444 storage medium 430 (for example, a storage device with a large amount of storage).
  • the memory 432 and the storage medium 430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 430 may include more than one module (not shown in the figure), and each module may include a series of instruction operations on the text marking device 400.
  • the central processing unit 422 may be configured to communicate with the storage medium 430, and execute a series of instruction operations in the storage medium 430 on the text annotation device 400 based on teacher supervision.
  • the text annotation device 400 based on teacher supervision may also include more than one power supply 429, more than one wired or wireless network interface 450, more than one input and output interface 458, more than one keyboard 456, and/or more than one operating system 441, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD, etc.
  • the text tagging device 400 based on teacher supervision may include a memory and more than one program, wherein more than one program is stored in the memory and is configured to be executed by more than one processor. Supervised text-marked instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本申请部分实施例提供了一种基于教师监督的文本标注方法和设备,属于自然语言处理技术领域。所述方法包括:利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果(101);通过预设的词语分割模型对待标注文本进行分词处理,生成包含分词词语的分词结果(102);根据每个标注词语与每个分词词语的相似度,基于分词词语对字符标注结果重新进行字符标注,得到融合标注结果并输出(103)。采用本申请,可以提高文本标注的准确率和召回率。

Description

一种基于教师监督的文本标注方法和设备
交叉引用
本申请引用于2019年4月26日递交的名称为“一种基于教师监督的文本标注方法和设备”的第201910342499.7号中国专利申请,其通过引用被全部并入本申请。
技术领域
本申请涉及自然语言处理技术领域,特别涉及一种基于教师监督的文本标注方法和设备。
背景技术
自然语言处理(Natural Language Processing,NLP)技术可以高效地对文本数据进行系统化分析、理解与信息提取,使得计算机能够理解自然语言以及生成自然语言,进而实现人与计算机之间采用自然语言进行有效交互(例如消息自动回复、语音助手等应用程序的使用)。其中,文本标注技术为自然语言处理的产业化应用提供了基础。
传统的机器学习(Machine Learning,ML)可以通过学习一定数量的文本数据,结合关键词(Seed Words)来挖掘文本之间的关联特征,得到传统机器学习模型,并利用该传统机器学习模型对其他文本自动分类和标注。大多数传统机器学习模型对文本高度依赖,通常主要关注文本的词法特征和句法特征,但忽略了文本的语义特征,不利于传统机器学习模型的性能提升,并且,大多数传统机器学习模型泛化性弱。因此,现有技术可以采用泛化性较高的深度学习(Deep Learning,DL)利用神经网络来挖掘文本的词法特征、句法特征和语义特征,通过不断迭代的方式训练得到深度学习模型,并利用该深度学习模型对文本进行自动标注。
在实现本申请的过程中,发明人发现现有技术至少存在以下问题:
由于中文词汇丰富多样,计算机难以覆盖中文字符排列组合得到的所有词语,为了提高深度学习模型的泛化性以及防止深度学习模型过拟合,针对中 文的文本标注技术通常利用基于字符粒度的深度学习模型对待标注文本进行标注处理。由于自然语言处理技术的不断发展,现有的基于字符粒度的深度学习模型不足以满足自然语言处理技术对文本标注不断提高的准确率要求。并且,当将一个训练成熟的深度学习模型应用到新的领域时,该深度学习模型的召回率不足甚至为零,导致深度学习模型泛化性差,词语边缘标注易固化。
发明内容
本申请部分实施例的目的在于提供一种基于教师监督的文本标注方法和设备,所述技术方案如下:
第一方面,提供了一种基于教师监督的文本标注方法,所述方法包括:
利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果;
通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果;
根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果并输出。
例如,所述利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果之前,还包括:
利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成所述字符标注模型。
例如,所述根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果之后,还包括:
基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练。
例如,所述基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练,包括:
将所述融合标注结果添加至融合标注集合;
从所述融合标注集合和所述训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;
利用所述新的训练样本集合对所述字符标注模型进行训练。
例如,所述利用所述新的训练样本集合对所述字符标注模型进行训练之前,还包括:
若所述词语分割模型对所述待标注文本进行分词处理失败,则将所述字符标注结果添加至回收标注集合;
从所述回收标注集合中抽取预设数量的所述字符标注结果添加至所述新的训练样本集合。
例如,所述通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果,包括:
若所述字符标注结果的平均置信度超过置信度阈值,则通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果。
例如,所述根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果,包括:
对所述字符标注结果中的每个所述标注词语与所述分词结果中的每个所述分词词语进行排列组合,得到相关词语对;
计算所有所述相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;
对替换后的所述字符标注结果重新进行字符标注,得到所述融合标注结果。
例如,所述方法还包括:
按照预设的递减函数根据所述字符标注模型的训练次数更新所述置信度阈值与所述相似度阈值。
第二方面,提供了一种基于教师监督的文本标注设备,所述设备包括:
字符标注模块,用于利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果;
词语分割模块,用于通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果;
融合标注模块,用于根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果并输出。
例如,所述字符标注模块,还用于:
利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成所述字符标注模型。
例如,所述字符标注模块,还用于:
基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练。
例如,所述融合标注模块,还用于:
将所述融合标注结果添加至融合标注集合;
所述字符标注模块,还用于:
从所述融合标注集合和所述训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;
利用所述新的训练样本集合对所述字符标注模型进行训练。
例如,所述词语分割模块,还用于:
若所述词语分割模型对所述待标注文本进行分词处理失败,则将所述字符标注结果添加至回收标注集合;
所述字符标注模块,还用于:
从所述回收标注集合中抽取预设数量的所述字符标注结果添加至所述新的训练样本集合。
例如,所述词语分割模块,具体用于:
若所述字符标注结果的平均置信度超过置信度阈值,则通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果。
例如,所述融合标注模块,具体用于:
对所述字符标注结果中的每个所述标注词语与所述分词结果中的每个所述分词词语进行排列组合,得到相关词语对;
计算所有所述相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;
对替换后的所述字符标注结果重新进行字符标注,得到所述融合标注结果。
例如,所述融合标注模块,还用于:
按照预设的递减函数根据所述字符标注模型的训练次数更新所述置信度 阈值与所述相似度阈值。
第三方面,提供了一种基于教师监督的文本标注设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面所述的基于教师监督的文本标注方法。
第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面所述的基于教师监督的文本标注方法。
本申请实施例提供的技术方案带来的有益效果是:
第一,利用词语分割模型对字符标注模型的字符标注结果进行检查和纠正,提高了字符标注模型对待标注文本进行标注处理的准确率和可靠性。第二,将最终得到的融合标注结果作为训练样本对字符标注模型进行训练,进而对剩余待标注文本进行标注处理,优化文本标注任务所需的模型参数,使字符标注结果更加可信。第三,当将字符标注模型应用到新的领域时,文本标注设备通过教师监督算法,可以快速检查和纠正字符标注结果,并利用融合标注结果强化训练字符标注模型,提高字符标注模型的准确率。第四,将针对包含词语分割模型无法识别的新词的待标注文本添加至训练样本集合,可以增强字符标注模型的泛化性,避免对词语边缘的标注固化,进而提高字符标注模型的召回率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种基于教师监督的文本标注方法的流程图;
图2为本申请实施例提供的一种基于教师监督的文本标注方法的逻辑示 意图;
图3为本申请实施例提供的一种基于教师监督的文本标注设备的功能模块示意图;
图4为本申请实施例提供的一种基于教师监督的文本标注设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例作详细描述。
本申请实施例提供了一种基于教师监督的文本标注方法,该方法的执行主体可以是文本标注设备,该文本标注设备可以通过基于字符粒度的深度学习模型(可称作字符标注模型)对一个文本标注任务中的大量待标注文本进行标注处理,然后通过基于词语粒度的语言模型(可称作词语分割模型)对相同的待标注文本进行词语分割处理(可称作分词处理),进而利用词语分割的结果(可称作分词结果)对初步标注的结果(可称作字符标注结果)进行检查和纠正,将融合标注结果作为待标注文本的最终标注结果。上述文本标注设备可以包括处理器和存储器,处理器可以用于进行下述流程中执行文本标注的处理,存储器可以用于存储下述处理过程中需要的数据以及产生的数据。
下面将结合具体实施例,对本申请实施例提供的一种基于教师监督的文本标注方法进行详细的说明。为了便于理解,请结合参考图1和图2,图1为本申请实施例提供的一种基于教师监督的文本标注方法的流程图;图2示出了基于教师监督的文本标注方法的实现逻辑,其中序号1至11分别代表文本标注设备在执行文本标注方法的过程中各个处理的逻辑顺序。
步骤101,文本标注设备利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果。
在实施中,针对待标注文本主要为词语之间无明显边界的语言文字(例如中文)的文本标注任务,一条待标注文本中通常包含一个或多个以名称为标识的词语,词语可以是单个字符,也可以由两个以上字符组成。因此,文本标注设备可以利用字符标注模型来预测每条待标注文本中的每个字符对应的标签,进而识别每条待标注文本中包含的一个或多个词语,生成包含一个或多个标注 词语的字符标注结果。以字符标注模型为命名实体识别模型(Named Entity Recognition,NER)为例,假设所有已标注文本和未标注文本中共有两类命名实体:地名和组织机构,对应的,针对所有已标注文本和未标注文本中的各个字符对应以下五类标签之一:LOC-B(地名首字),LOC-I(地名非首字),ORG-B(组织机构首字),ORG-I(组织机构非首字),O(非命名实体)。针对待标注文本“日本的富士山”,文本标注设备利用命名实体识别模型对待标注文本中每个字符的初步标注结果为:日/LOC-B、本/LOC-I、的/O、富/ORG-B、士/ORG-I、山/ORG-I,基于该初步标注结果,文本标注设备可以生成包含“日本”和“富士山”这两个标注词语的字符标注结果。其中,上述的标签是技术人员预先设置的,不同的文本标注任务可以具备不同标签。
值得一提的是,有些待标注文本中除了一种语言文字外,还可能夹杂少量其他语言文字。例如,一条主要为中文的待标注文本中可能包含双语命名实体“IP地址”,此时,字符标注模型可以基于英文单词粒度(word粒度)对待标注文本中的英文进行标注。文本标注设备可以将英文单词“IP”标注为首字,将“地”和“址”均标注为非首字。
例如,文本标注设备可以在使用字符标注模型标注文本前,先利用预先设置的一定数量的已标注文本对初始字符标注模型进行训练。相应的,步骤101之前的处理可以如下:文本标注设备利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成字符标注模型。
在实施中,在文本标注设备利用字符标注模型对待标注文本进行标注处理之前,技术人员可以预先对文本标注任务中的少量待标注文本进行人工标注,得到包含多条已标注文本的训练样本集合。文本标注设备利用训练样本集合中经过人工标注的多条已标注文本对初始字符标注模型进行训练,生成字符标注模型。可以理解,不同的文本标注任务中的待标注文本的特征存在一定差别,因此,针对不同的文本标注任务,字符标注模型所需的用于预测每条待标注文本中的每个字符对应的标签的模型参数也会不同。针对一个文本标注任务,文本标注设备需要利用该文本标注任务对应的训练样本集合对初始字符标注模型进行训练,从而获得该文本标注任务所需的模型参数,初步生成该文本标注任务适用的字符标注模型。
步骤102,文本标注设备通过预设的词语分割模型对待标注文本进行分词 处理,生成包含分词词语的分词结果。
在实施中,技术人员可以选择与字符标注模型具有相同的语言表征特性的基于词语粒度的语言模型(例如中文分词系统(Chinese Segmentation System)、深度语境化词向量模型(Embedding From Language Model,ELMo)、知识图谱(Knowledge Graph)等),使文本标注设备通过迁移学习预先对经过预训练的语言模型进行细微调整(例如,文本标注设备采用训练样本集合中经过人工标注的多条已标注文本对经过预训练的语言模型进行再次训练),得到适用于当前文本标注任务的语言模型(词语分割模型),无需从零开始训练一个词语分割模型,以减少模型训练时间。文本标注设备可以通过该词语分割模型对待标注文本进行分词处理,生成包含分词词语的分词结果。以词语分割模型为中文分词系统为例,文本标注设备可以通过中文分词系统对待标注文本“日本的富士山”进行分词处理,生成包含“日本”、“的”和“富士山”这三个分词词语的分词结果。
例如,文本标注设备利用字符标注模型生成的字符标注结果中的标注词语可能被标注错误,因此可以预先设置置信度阈值来评估字符标注结果是否可信。相应的,步骤102的具体处理可以如下:若字符标注结果的平均置信度超过置信度阈值,文本标注设备则通过预设的词语分割模型对待标注文本进行分词处理,生成包含分词词语的分词结果。
在实施中,文本标注设备利用字符标注模型对待标注文本进行标注处理时,可以计算每个字符的初步标注结果的置信度,并计算待标注文本的所有字符对应的置信度的平均数,得到待标注文本的字符标注结果的平均置信度。当字符标注结果的平均置信度超过置信度阈值,说明该字符标注结果在一定程度上是可信的,此时,文本标注设备则可以通过预设的词语分割模型对待标注文本进行分词处理,以利用分词结果检查字符标注结果是否标注正确,并对字符标注结果中标注错误的标注词语进行纠正。可以理解,当字符标注结果的平均置信度未达到置信度阈值,说明该字符标注结果在一定程度上是不可信的,字符标注模型对该待标注文本进行标注处理失败,此时,可以将该字符标注结果对应的待标注文本丢弃。其中,每个字符的初步标注结果的置信度的计算方法可以如下:文本标注设备利用命名实体识别模型的LSTM层(Long Short-Term Memory,长短期记忆)先计算出待标注文本中的每个字符被标注为预设的各个标签的评分,然后根据每个字符对应的各个标签的评分利用命名实体识别模型 的CRF层(Conditional Random Fields,条件随机场)生成字符标注结果及字符标注结果中每个字符的初步标注结果的置信度(Confidence)。其中,置信度为CRF层的输出结果,具体计算过程本申请在此不作赘述。
步骤103,文本标注设备根据每个标注词语与每个分词词语的相似度,基于分词词语对字符标注结果重新进行字符标注,得到融合标注结果并输出。
在实施中,文本标注设备可以利用词语分割模型生成的分词结果检查字符标注模型生成的字符标注结果是否标注正确。具体的,文本标注设备可以利用基于统计为主的机器学习算法(例如,TF-IDF(Term Frequency-Inverse Document Frequency,词频-逆向文本频率)结合Cosine similarity(余弦相似性)、汉明距离或SimHash等)计算字符标注结果与分词结果的相似度。标注词语与分词词语的相似度的值越大,说明该标注词语与该分词词语的属性和功能越相近,因此,当相似度达到一定的标准,文本标注设备可以基于分词词语对字符标注结果重新进行字符标注,从而得到融合标注结果,并将融合标注结果作为标注结果进行输出。
例如,针对相同待标注文本,文本标注设备可以对字符标注结果中的所有标注词语与分词结果中的所有分词词语的相似度进行分别计算,相应的,步骤103的具体处理可以如下:文本标注设备对字符标注结果中的每个标注词语与分词结果中的每个分词词语进行排列组合,得到相关词语对;文本标注设备计算所有相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;文本标注设备对替换后的字符标注结果重新进行字符标注,得到融合标注结果。
在实施中,以待标注文本为“日本的富士山”为例,文本标注设备通过命名实体识别模型对“日本的富士山”的初步标注结果可能为:日/LOC-B、本/LOC-I、的/LOC-I、富/O、士/ORG-B、山/ORG-I,基于该初步标注结果,文本标注设备生成的字符标注结果为“日本的”和“士山”;文本标注设备通过中文分词系统生成的分词结果为“日本”、“的”和“富士山”。此时,文本标注设备对字符标注结果及分词结果进行排列组合得到的所有相关词语对为:(日本的,日本)、(日本的,的)、(日本的,富士山)、(士山,日本)、(士山,的)和(士山,富士山)。之后,文本标注设备可以利用基于统计为主的机器学习算法计算出相似度超过相似度阈值的相关词语对为(日本的,日本)和(士山,富士山),文本标注设备 用分词词语“日本”和“富士山”分别替换字符标注结果中对应的标注词语“日本的”和“士山”。由于分词词语未携带基于字符粒度的标签,文本标注设备可以对替换后的字符标注结果重新进行字符标注,得到融合标注结果:日/LOC-B本/LOC-I的/O富/ORG-B士/ORG-I山/ORG-I。
例如,可以将融合标注结果作为训练样本对字符标注模型进行增强训练,相应的,步骤103之后的处理还可以如下:文本标注设备基于融合标注结果和训练样本集合对字符标注模型进行训练。
在实施中,为了获得大量训练样本来优化字符标注模型的性能,并且尽可能减少人工投入,可以将融合标注结果作为训练样本对字符标注模型进行训练(可称作迭代训练),同时采用训练样本集合中的已标注文本对字符标注模型进行训练,增强正确标注的标注词语占有的权重。
例如,随着字符标注模型的训练次数的增加,可以适当降低置信度阈值,相应的处理还可以如下:文本标注设备按照预设的递减函数根据字符标注模型的训练次数更新置信度阈值与相似度阈值。
在实施中,随着字符标注模型的多次训练,一方面,字符标注模型对待标注文本的字符标注结果越可信,可以降低置信度阈值,使文本标注设备利用词语分割模型对更多的字符标注结果进行检查;另一方面,词语分割模型遇到的新词将不断减少,词语分割模型对待标注文本的分词结果越可信,可以降低相似度阈值,避免因相似度阈值过高导致正确标注的融合标注结果无法被召回。因此,文本标注设备可以按照预先设置的递减函数根据字符标注模型的训练次数更新置信度阈值与相似度阈值。
值得一提的是,用于计算置信度阈值的递减函数可以是:Confidence threshold=a-1×10 -4×time_step;用于计算相似度阈值的递减函数可以是:Similarity threshold=b-1×10 -4×time_step。其中,常数a表示置信度阈值的最大值,取值范围为(0,1);常数b表示相似度阈值的最大值,取值范围为(0,1);time_step表示字符标注模型的训练步长,字符标注模型的迭代训练次数越多,训练步长的值越大。可以理解,技术人员可以根据经验设置常数a和常数b的大小,本申请对此不作限制。
例如,对字符标注模型进行训练时采用的训练样本总数可以基本保持一致,文本标注设备基于融合标注结果和训练样本集合对字符标注模型进行训练 的具体处理可以如下:文本标注设备将融合标注结果添加至融合标注集合;文本标注设备从融合标注集合和训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;文本标注设备利用新的训练样本集合对字符标注模型进行训练。
在实施中,文本标注设备通常需要对字符标注模型进行多次迭代训练来获得性能良好的字符标注模型,从而对更多待标注文本进行准确的标注处理。成功完成待标注文本的标注处理及分词处理之后,文本标注设备可以生成融合标注结果,并将该融合标注结果添加至融合标注集合中。然后,文本标注设备可以分别从融合标注集合和原有的训练样本集合中抽取预设数量的已标注文本组成新的训练样本集合,文本标注设备可以利用新的训练样本集合对字符标注模型进行训练,从而优化字符标注模型的模型参数。例如,对字符标注模型进行训练时采用的训练样本总数可以基本保持在1000条,文本标注设备可以从融合标注集合中随机抽取600条已标注文本,并从原有的训练样本集合中随机抽取400条已标注文本,合并为新的训练样本总数为1000条的训练样本集合,随后利用该新的训练样本集合对字符标注模型进行再次训练。可以理解,文本标注设备可以在保持训练样本总数基本不变的情况下,按照一定比例(例如3:2)分别从融合标注集合及训练样本集合中随机抽取已标注文本,组成新的训练样本集合。
值得一提的是,字符标注模型的性能可以在词语分割模型的监督下获得一定程度的提高,并且快速接近或达到词语分割模型的性能,因此,融合标注集合中融合标注结果的数量可以随着字符标注模型迭代训练次数的增加而不断增多。可以理解,当融合标注集合中融合标注结果的数量不再发生变化时,说明字符标注模型的性能可能未因最近一次的迭代训练得到优化,可以认为字符标注模型的性能已经到达最佳,文本标注设备可以暂停对字符标注模型进行迭代训练。
例如,文本标注任务的待标注文本中可能存在词语分割模型无法识别的新词,文本标注设备可以将经字符标注模型标注过的包含新词的待标注文本加入新的训练样本集合,来提高召回率。相应的,文本标注设备利用新的训练样本集合对字符标注模型进行训练之前的处理还可以如下:若词语分割模型对待标注文本进行分词处理失败,文本标注设备则将字符标注结果添加至回收标注 集合;文本标注设备从回收标注集合中抽取预设数量的字符标注结果添加至新的训练样本集合。
在实施中,随着社会不断发展,被人类公认(例如,中外人名、地名、组织机构名、缩略语和派生词等)但未被词语分割模型使用的分词词典收录的词语(可称作新词)会不断产生。针对包含新词的待标注文本,文本标注设备可以通过字符标注模型对待标注文本进行标注处理,生成字符标注结果,但对于未收录在分词词典中的新词,词语分割模型将无法识别,进而无法生成分词结果来对字符标注模型生成的字符标注结果进行监督,不能生成标注正确的融合标注结果。此时,文本标注设备可以将词语分割模型无法准确识别但经过字符标注模型正确标注的待标注文本添加至回收标注集合,文本标注设备从回收标注集合中随机抽取预设数量的字符标注结果添加至新的训练样本集合,以再次对字符标注模型进行训练,可以提高字符标注模型的召回率。可以理解,若回收标注集合中的字符标注结果有误,随机采样可以避免大量错误字符标注结果流入,从字符标注模型已学习的已标注文本来看,标记错误的字符标注结果再次出现的概率较小,对字符标注模型的性能影响不大,并且随着字符标注模型的多次迭代训练,用于训练字符标注模型的标记错误的字符标注结果的权重会进一步弱化,对字符标注模型的性能影响可忽略不计。
值得一提的是,在文本标注设备对字符标注模型进行迭代训练的次数较少时,字符标注模型的性能不稳定,回收标注集合中的字符标注结果出现错误的概率较大,此时,可以人工对文本标注设备从回收标注集合中随机抽取预设数量的字符标注结果进行检查和纠正,将正确标注的字符标注结果添加至新的训练样本集合,一方面可以避免标记错误的字符标注结果对字符标注模型的性能产生影响,另一方面可以增强正确标注的标注词语占有的权重。
本申请实施例中,通过教师监督(Teacher Forcing)对字符标注模型进行迭代训练,可以带来以下有益效果:第一,利用词语分割模型对字符标注模型的字符标注结果进行检查和纠正,提高了字符标注模型对待标注文本进行标注处理的准确率和可靠性。第二,将最终得到的融合标注结果作为训练样本对字符标注模型进行训练,进而对剩余待标注文本进行标注处理,优化文本标注任务所需的模型参数,使字符标注结果更加可信。第三,当将字符标注模型应用到新的领域时,文本标注设备通过教师监督算法,可以快速检查和纠正字符标 注结果,并利用融合标注结果强化训练字符标注模型,提高字符标注模型的准确率。第四,将针对包含词语分割模型无法识别的新词的待标注文本添加至训练样本集合,可以增强字符标注模型的泛化性,避免对词语边缘的标注固化,进而提高字符标注模型的召回率。
基于相同的技术构思,本申请实施例还提供了一种基于教师监督的文本标注设备,如图3所示,所述设备包括:
字符标注模块,用于利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果;
词语分割模块,用于通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果;
融合标注模块,用于根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果并输出。
例如,所述字符标注模块,还用于:
利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成所述字符标注模型。
例如,所述字符标注模块,还用于:
基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练。
例如,所述融合标注模块,还用于:
将所述融合标注结果添加至融合标注集合;
所述字符标注模块,还用于:
从所述融合标注集合和所述训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;
利用所述新的训练样本集合对所述字符标注模型进行训练。
例如,所述词语分割模块,还用于:
若所述词语分割模型对所述待标注文本进行分词处理失败,则将所述字符标注结果添加至回收标注集合;
所述字符标注模块,还用于:
从所述回收标注集合中抽取预设数量的所述字符标注结果添加至所述新的训练样本集合。
例如,所述词语分割模块,具体用于:
若所述字符标注结果的平均置信度超过置信度阈值,则通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果。
例如,所述融合标注模块,具体用于:
对所述字符标注结果中的每个所述标注词语与所述分词结果中的每个所述分词词语进行排列组合,得到相关词语对;
计算所有所述相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;
对替换后的所述字符标注结果重新进行字符标注,得到所述融合标注结果。
例如,所述融合标注模块,还用于:
按照预设的递减函数根据所述字符标注模型的训练次数更新所述置信度阈值与所述相似度阈值。
本申请实施例中,通过教师监督(Teacher Forcing)对字符标注模型进行迭代训练,可以带来以下有益效果:第一,利用词语分割模型对字符标注模型的字符标注结果进行检查和纠正,提高了字符标注模型对待标注文本进行标注处理的准确率和可靠性。第二,将最终得到的融合标注结果作为训练样本对字符标注模型进行训练,进而对剩余待标注文本进行标注处理,优化文本标注任务所需的模型参数,使字符标注结果更加可信。第三,当将字符标注模型应用到新的领域时,文本标注设备通过教师监督算法,可以快速检查和纠正字符标注结果,并利用融合标注结果强化训练字符标注模型,提高字符标注模型的准确率。第四,将针对包含词语分割模型无法识别的新词的待标注文本添加至训练样本集合,可以增强字符标注模型的泛化性,避免对词语边缘的标注固化,进而提高字符标注模型的召回率。
需要说明的是:上述实施例提供的基于教师监督的文本标注设备在进行文本标注时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成 不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的基于教师监督的文本标注设备与基于教师监督的文本标注方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图4是本申请实施例提供的基于教师监督的文本标注设备的结构示意图。该基于教师监督的文本标注设备400可因配置或性能不同而产生比较大的差异,可以包括一个以上中央处理器422(例如,一个以上处理器)和存储器432,一个以上存储应用程序442或数据444的存储介质430(例如一个以上海量存储设备)。其中,存储器432和存储介质430可以是短暂存储或持久存储。存储在存储介质430的程序可以包括一个以上模块(图示没标出),每个模块可以包括对文本标注设备400中的一系列指令操作。例如,中央处理器422可以设置为与存储介质430通信,在基于教师监督的文本标注设备400上执行存储介质430中的一系列指令操作。
基于教师监督的文本标注设备400还可以包括一个以上电源429,一个以上有线或无线网络接口450,一个以上输入输出接口458,一个以上键盘456,和/或,一个以上操作系统441,例如Windows Server,Mac OS X,Unix,Linux,FreeBSD等等。
基于教师监督的文本标注设备400可以包括有存储器,以及一个以上的程序,其中一个以上程序存储于存储器中,且经配置以由一个以上处理器执行所述一个以上程序包含用于进行上述基于教师监督的文本标注的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种基于教师监督的文本标注方法,所述方法包括:
    利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果;
    通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果;
    根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果并输出。
  2. 如权利要求1所述的方法,其中,所述利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果之前,还包括:
    利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成所述字符标注模型。
  3. 如权利要求2所述的方法,其中,所述根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果之后,还包括:
    基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练。
  4. 如权利要求3所述的方法,其中,所述基于所述融合标注结果和所述训练样本集合对所述字符标注模型进行训练,包括:
    将所述融合标注结果添加至融合标注集合;
    从所述融合标注集合和所述训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;
    利用所述新的训练样本集合对所述字符标注模型进行训练。
  5. 如权利要求4所述的方法,其中,所述利用所述新的训练样本集合对所述字符标注模型进行训练之前,还包括:
    若所述词语分割模型对所述待标注文本进行分词处理失败,则将所述字符 标注结果添加至回收标注集合;
    从所述回收标注集合中抽取预设数量的所述字符标注结果添加至所述新的训练样本集合。
  6. 如权利要求1所述的方法,其中,所述通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果,包括:
    若所述字符标注结果的平均置信度超过置信度阈值,则通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果。
  7. 如权利要求1所述的方法,其中,所述根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果,包括:
    对所述字符标注结果中的每个所述标注词语与所述分词结果中的每个所述分词词语进行排列组合,得到相关词语对;
    计算所有所述相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;
    对替换后的所述字符标注结果重新进行字符标注,得到所述融合标注结果。
  8. 如权利要求6或7所述的方法,其中,所述方法还包括:
    按照预设的递减函数根据所述字符标注模型的训练次数更新所述置信度阈值与所述相似度阈值。
  9. 一种基于教师监督的文本标注设备,所述设备包括:
    字符标注模块,用于利用字符标注模型对待标注文本进行标注处理,生成包含标注词语的字符标注结果;
    词语分割模块,用于通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果;
    融合标注模块,用于根据每个标注词语与每个分词词语的相似度,基于所述分词词语对所述字符标注结果重新进行字符标注,得到融合标注结果并输出。
  10. 如权利要求9所述的设备,其中,所述字符标注模块,还用于:
    利用训练样本集合中的已标注文本对初始字符标注模型进行训练,生成所述字符标注模型。
  11. 如权利要求10所述的设备,其中,
    所述融合标注模块,还用于:
    将所述融合标注结果添加至融合标注集合;
    所述字符标注模块,还用于:
    从所述融合标注集合和所述训练样本集合中抽取预设数量的已标注文本,生成新的训练样本集合;
    利用所述新的训练样本集合对所述字符标注模型进行训练。
  12. 如权利要求9所述的设备,其中,所述词语分割模块,具体用于:
    若所述字符标注结果的平均置信度超过置信度阈值,则通过预设的词语分割模型对所述待标注文本进行分词处理,生成包含分词词语的分词结果。
  13. 如权利要求9所述的设备,其中,所述融合标注模块,具体用于:
    对所述字符标注结果中的每个所述标注词语与所述分词结果中的每个所述分词词语进行排列组合,得到相关词语对;
    计算所有所述相关词语对的相似度,并用相似度超过相似度阈值的相关词语对中的分词词语替换标注词语;
    对替换后的所述字符标注结果重新进行字符标注,得到所述融合标注结果。
  14. 一种基于教师监督的文本标注设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至8任一所述的基于教师监督的文本标注方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述 代码集或指令集由处理器加载并执行以实现如权利要求1至8任一所述的基于教师监督的文本标注方法。
PCT/CN2019/090336 2019-04-26 2019-06-06 一种基于教师监督的文本标注方法和设备 WO2020215456A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19886050.4A EP3751445A4 (en) 2019-04-26 2019-06-06 METHOD AND DEVICE FOR MARKING TEXT ON THE BASIS OF TEACHER FORCE
US16/888,591 US20200380209A1 (en) 2019-04-26 2020-05-29 Method and apparatus for tagging text based on teacher forcing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910342499.7 2019-04-26
CN201910342499.7A CN110134949B (zh) 2019-04-26 2019-04-26 一种基于教师监督的文本标注方法和设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/888,591 Continuation US20200380209A1 (en) 2019-04-26 2020-05-29 Method and apparatus for tagging text based on teacher forcing

Publications (1)

Publication Number Publication Date
WO2020215456A1 true WO2020215456A1 (zh) 2020-10-29

Family

ID=67575137

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090336 WO2020215456A1 (zh) 2019-04-26 2019-06-06 一种基于教师监督的文本标注方法和设备

Country Status (4)

Country Link
US (1) US20200380209A1 (zh)
EP (1) EP3751445A4 (zh)
CN (1) CN110134949B (zh)
WO (1) WO2020215456A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943229A (zh) * 2022-04-15 2022-08-26 西北工业大学 一种基于多级别特征融合的软件缺陷命名实体识别方法
CN115620722A (zh) * 2022-12-15 2023-01-17 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质
CN117422061A (zh) * 2023-12-19 2024-01-19 中南大学 一种文本词项多重分割结果合并标注方法及装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079406B (zh) * 2019-12-13 2022-01-11 华中科技大学 自然语言处理模型训练方法、任务执行方法、设备及系统
CN111783518A (zh) * 2020-05-14 2020-10-16 北京三快在线科技有限公司 训练样本生成方法、装置、电子设备及可读存储介质
CN111859951B (zh) * 2020-06-19 2024-03-26 北京百度网讯科技有限公司 语言模型的训练方法、装置、电子设备及可读存储介质
CN111738024B (zh) * 2020-07-29 2023-10-27 腾讯科技(深圳)有限公司 实体名词标注方法和装置、计算设备和可读存储介质
CN112667779B (zh) * 2020-12-30 2023-09-05 北京奇艺世纪科技有限公司 一种信息查询方法、装置、电子设备及存储介质
CN113656579B (zh) * 2021-07-23 2024-01-26 北京亿欧网盟科技有限公司 文本分类方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929870A (zh) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 一种建立分词模型的方法、分词的方法及其装置
CN103077164A (zh) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 文本分析方法及文本分析器
CN106503192A (zh) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 基于人工智能的命名实体识别方法及装置
CN107330011A (zh) * 2017-06-14 2017-11-07 北京神州泰岳软件股份有限公司 多策略融合的命名实体的识别方法及装置
JP2019021096A (ja) * 2017-07-19 2019-02-07 日本電信電話株式会社 言語解析装置、方法、及びプログラム

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679885B (zh) * 2015-03-17 2018-03-30 北京理工大学 一种基于语义特征模型的用户搜索串机构名识别方法
CN106372060B (zh) * 2016-08-31 2019-05-03 北京百度网讯科技有限公司 搜索文本的标注方法和装置
EP3642733A4 (en) * 2017-07-31 2020-07-22 Beijing Didi Infinity Technology and Development Co., Ltd. SYSTEM AND PROCESS FOR SEGMENTING A SENTENCE
CN109241520B (zh) * 2018-07-18 2023-05-23 五邑大学 一种基于分词和命名实体识别的多层误差反馈神经网络的句子主干分析方法及系统
CN109255119B (zh) * 2018-07-18 2023-04-25 五邑大学 一种基于分词和命名实体识别的多任务深度神经网络的句子主干分析方法及系统
CN109190110B (zh) * 2018-08-02 2023-08-22 厦门快商通信息技术有限公司 一种命名实体识别模型的训练方法、系统及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929870A (zh) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 一种建立分词模型的方法、分词的方法及其装置
CN103077164A (zh) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 文本分析方法及文本分析器
CN106503192A (zh) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 基于人工智能的命名实体识别方法及装置
CN107330011A (zh) * 2017-06-14 2017-11-07 北京神州泰岳软件股份有限公司 多策略融合的命名实体的识别方法及装置
JP2019021096A (ja) * 2017-07-19 2019-02-07 日本電信電話株式会社 言語解析装置、方法、及びプログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3751445A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943229A (zh) * 2022-04-15 2022-08-26 西北工业大学 一种基于多级别特征融合的软件缺陷命名实体识别方法
CN114943229B (zh) * 2022-04-15 2024-03-12 西北工业大学 一种基于多级别特征融合的软件缺陷命名实体识别方法
CN115620722A (zh) * 2022-12-15 2023-01-17 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质
CN117422061A (zh) * 2023-12-19 2024-01-19 中南大学 一种文本词项多重分割结果合并标注方法及装置
CN117422061B (zh) * 2023-12-19 2024-03-08 中南大学 一种文本词项多重分割结果合并标注方法及装置

Also Published As

Publication number Publication date
US20200380209A1 (en) 2020-12-03
CN110134949A (zh) 2019-08-16
EP3751445A4 (en) 2021-03-10
EP3751445A1 (en) 2020-12-16
CN110134949B (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
WO2020215456A1 (zh) 一种基于教师监督的文本标注方法和设备
WO2020215457A1 (zh) 一种基于对抗学习的文本标注方法和设备
CN110717039B (zh) 文本分类方法和装置、电子设备、计算机可读存储介质
CN111274394B (zh) 一种实体关系的抽取方法、装置、设备及存储介质
CN109766540B (zh) 通用文本信息提取方法、装置、计算机设备和存储介质
US10698932B2 (en) Method and apparatus for parsing query based on artificial intelligence, and storage medium
US20200342172A1 (en) Method and apparatus for tagging text based on adversarial learning
CN111062217B (zh) 语言信息的处理方法、装置、存储介质及电子设备
US11907671B2 (en) Role labeling method, electronic device and storage medium
WO2021174864A1 (zh) 基于少量训练样本的信息抽取方法及装置
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
Yuan et al. Distant supervision for relation extraction with linear attenuation simulation and non-iid relevance embedding
CN111950256A (zh) 断句处理方法、装置、电子设备和计算机存储介质
US20220043982A1 (en) Toxic vector mapping across languages
CN113743101B (zh) 文本纠错方法、装置、电子设备和计算机存储介质
US20220414463A1 (en) Automated troubleshooter
CN113901170A (zh) 结合Bert模型和模板匹配的事件抽取方法及系统、电子设备
CN111126061A (zh) 对联信息生成方法和装置
CN110633724A (zh) 意图识别模型动态训练方法、装置、设备和存储介质
Namysl et al. NAT: Noise-aware training for robust neural sequence labeling
CN112528605B (zh) 文本风格处理方法、装置、电子设备和存储介质
CN112466277B (zh) 韵律模型训练方法、装置、电子设备及存储介质
Mi et al. Toward better loanword identification in Uyghur using cross-lingual word embeddings
CN111328416B (zh) 用于自然语言处理中的模糊匹配的语音模式
CN114580391A (zh) 中文错误检测模型训练方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019886050

Country of ref document: EP

Effective date: 20200527

NENP Non-entry into the national phase

Ref country code: DE