CN110287482B - Semi-automatic participle corpus labeling training device - Google Patents

Semi-automatic participle corpus labeling training device Download PDF

Info

Publication number
CN110287482B
CN110287482B CN201910455093.XA CN201910455093A CN110287482B CN 110287482 B CN110287482 B CN 110287482B CN 201910455093 A CN201910455093 A CN 201910455093A CN 110287482 B CN110287482 B CN 110287482B
Authority
CN
China
Prior art keywords
model
labeling
corpus
participle
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910455093.XA
Other languages
Chinese (zh)
Other versions
CN110287482A (en
Inventor
代翔
崔莹
黄细凤
孙涛
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201910455093.XA priority Critical patent/CN110287482B/en
Publication of CN110287482A publication Critical patent/CN110287482A/en
Application granted granted Critical
Publication of CN110287482B publication Critical patent/CN110287482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semi-automatic participle corpus labeling training device, aiming at solving the defects of corpus labeling and training. The invention is realized by the following technical scheme: the text corpus tagging preparation module manages to-be-tagged corpora and participle corpora, submits the raw corpus participle tagging work to a semi-automatic corpus participle tagging module through multiple participle algorithms such as bidirectional maximum matching participle, CRF, JIEBA and the like based on an integrated dictionary, creates a participle tagging task, selects a tagging applicable algorithm model, develops automatic tagging, feeds back training model corpora and tagging models generated by the text corpus tagging preparation module to a feedback type model learning training module on the basis of fusion of automatic tagging results, selects and trains a model, calls a unified training model interface to generate a core dictionary, updates a participle training model table, establishes a tagging algorithm comprehensive evaluation model to evaluate the model tagging effect, and completes a new participle tagging task.

Description

Semi-automatic participle corpus labeling training device
Technical Field
The invention relates to the technical field of text mining, in particular to a semi-automatic labeling training device for participle corpora.
Background
Words are the smallest, independently movable, meaningful language components, but there is no obvious distinguishing mark between words in chinese, so chinese word analysis is the basis and key of chinese information processing. The accuracy of word segmentation and the accuracy of part-of-speech tagging are closely related, so that the word segmentation process and the part-of-speech tagging process are organically integrated, ambiguity elimination is facilitated, and the overall efficiency is improved. Chinese sentences are composed of successive words without space separation between words. Part-of-speech tagging refers to the process of determining an appropriate part-of-speech for each word in a sentence. Chinese word segmentation is the first process of Chinese information processing, and plays an extremely important role in many application fields (text word segmentation, event extraction, text summarization, information retrieval, etc.). The word segmentation and part-of-speech tagging are basic processing of the speech, and are collectively called as linguistic word segmentation tagging. However, the labeled participle corpus is few, the improvement of the large task effect of the participle is indirect, the influence of different participle errors is very different in an actual system, in addition, the cost for obtaining the participle corpus is very expensive, and people are difficult to skillfully label the participle corpus consistently according to a certain standard, so that the scale of the participle corpus is quite limited at present with large data volume and large computing capacity. Part-of-speech tagging is a step immediately after word segmentation in an information processing flow, and an adopted algorithm principle is similar to word segmentation, so that word segmentation and part-of-speech tagging are often integrally processed in the implementation of many systems. However, the present domain is relatively deficient in participle corpus, and the participle corpus tagging work is mainly completed by manual tagging, and the full manual word tagging for the corpus is busy and neat like ants, which is very time consuming, and has the problems of poor corpus tagging quality, complex tagging process, low tagging efficiency, high human resource cost, and the like. Meanwhile, the existing segmentation corpus labeling tool has the defects of single labeling method, incapability of automatically updating a labeling method model and the like, so that a semi-automatic segmentation labeling and training platform capable of assisting in manual labeling of corpora is urgently needed to solve the problems. If a semi-automatic participle marking method and a semi-automatic marking device designed based on the method are provided, a pre-marking result can be rapidly provided for the participle corpus to be processed in a fully automatic way, which is very good.
In recent years, with the rapid development of large data acquisition means, the maximum value of data mining becomes particularly urgent, which puts a new demand on intelligent analysis of large data. Under the background, technologies such as machine learning and deep learning are rapidly developed and have great success in big data application, and model algorithms used at the bottom of the technologies need to rely on a large amount of data labeling corpora as basic training supports. The mass data corpus labeling work has an important influence on the training of an algorithm model, is used as basic work in a big data analysis process, mainly supports links such as daily research and development, algorithm tuning, demonstration and verification of big data, and is a core foundation of big data mining analysis. The key to word segmentation depends on the dictionary, which is not very complete but sufficient for general applications. The jieba plug-in can divide a Chinese word into three word division modes, and can adapt to different requirements. Chinese Word Segmentation refers to the Segmentation of a Chinese character sequence into a single Word. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification.
Existing word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The word segmentation method based on character string matching comprises the following steps: the method is also called mechanical word segmentation method, which matches the Chinese character string to be analyzed with the entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized).
1) Positive maximum matching method (from left to right direction)
2) Inverse maximum matching method (right to left direction):
3) minimum segmentation (minimizing the number of words cut out in each sentence)
4) Bidirectional maximum matching method (scanning from left to right and from right to left)
Understanding-based word segmentation method: the word segmentation method achieves the effect of recognizing words by enabling a computer to simulate human understanding of sentences. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. This word segmentation method requires the use of a large amount of linguistic knowledge and information. Because of the general and complex Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by a machine, so the existing understanding-based word segmentation system is still in a test stage.
The word segmentation method based on statistics comprises the following steps: a large amount of texts with segmented words are given, and a rule of word segmentation is learned by using a statistical machine learning model (called training), so that segmentation of unknown texts is realized. Such as a maximum probability word segmentation method, a maximum entropy word segmentation method, and the like. With the establishment of large-scale corpora and the research and development of statistical machine learning methods, the Chinese word segmentation method based on statistics gradually becomes the mainstream method.
The main statistical model is as follows: n-gram (N-gram), Hidden Markov Model (HMM), maximum entropy Model (ME), Conditional Random field Model (CRF), etc. The lexical analysis is an important basic technology of NLP, and comprises word segmentation, part of speech tagging, entity identification and the like, and the main algorithm structure of the lexical analysis is based on a Bi-LSTM-CRF algorithm system. The use of CRF is to obtain a globally optimal output sequence, which corresponds to the reuse of lstm information. From the network structure, the Bi-LSTM-CRF is applied to the CRF framework, but the output of the LSTM at the ith tag at each time t is regarded as a point function (a characteristic function only related to the current position) in the CRF characteristic function, and then an edge function (a characteristic function related to the front position and the back position) is carried by the CRF. This changes the characteristic function (linear) of the original form in the linear chain CRF into the output f of the LSTM1(non-linearity), which introduces non-linearity in the original CRF, allows a better fit to the data. Bi-LSTM, Bi-directional LSTM, captures context information better than unidirectional LSTM. The Bi-LSTM is actually two LSTMs based on long-term and short-term memory, except that the reverse LSTM firstly reverses the input data end to end, then runs a normal LSTM, and then reverses the output result once to ensure that the input data is input with the forward LSTMAnd correspond to each other.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, aims to overcome the defects of the use of the corpus in the participle corpus labeling and training process, and provides a semi-automatic participle corpus labeling training device.
The above object of the present invention can be achieved by the following measures, a semi-automatic participle corpus tagging training device, comprising: text corpus label prepares module, semi-automatization corpus participle label module, feedback type model learning training module and participle label model effect evaluation module, its characterized in that: the text corpus tagging preparation module provides preparation for a tagging task, performs single word segmentation pre-tagging processing on the corpus data to be tagged according to sources or topics by distinguishing and selecting the corpus sources of different sources, realizes the management of the corpus and the participle corpus, and submits the raw corpus participle tagging work to the semi-automatic corpus participle tagging module through various participle algorithms such as bidirectional maximum matching participle, CRF, JIEBA, BI-LSTM and the like based on an integrated dictionary; the semi-automatic corpus participle tagging module creates participle tagging tasks according to different tagging use requirements and corpus characteristics, selects a tagging applicable algorithm model, manages and develops automatic tagging according to tagging business rules, completes automatic tagging of each class of tagging tasks based on one selected participle algorithm model and business rules in multiple participle algorithms such as bidirectional maximum matching participle, CRF, JIEBA, BI-LSTM and the like of an integrated dictionary, and performs tagging result fusion based on an automatic tagging result of the algorithm model and an automatic tagging result based on the business rules; on the basis of automatic labeling result fusion, carrying out manual intervention judgment according to labeling service standards, storing labeling results, feeding back training model corpora and labeling models generated by a text corpus labeling preparation module to a feedback type model learning training module, carrying out model parameter setting, model corpus selection and model learning training according to the existing models and external depth enhancement model loading, and returning to model parameter setting after the models are perfected and updated; calling a unified training model interface Train to generate a core dictionary and an N-gram core dictionary, then importing an external algorithm model according to a unified model access interface, updating or exporting the model, storing a participle model file comprising the core dictionary and the N-gram dictionary file, updating a participle training model table, establishing a marking algorithm comprehensive evaluation model to evaluate the marking effect of the model, and updating the model for participle marking in a platform by using the trained model through continuous iteration between model updating and corpus marking to complete a new participle marking task; the word segmentation and annotation model effect evaluation module builds a single index algorithm according to the index standard, quantifies indexes according to index calculation rules, builds an annotation algorithm comprehensive evaluation model by organizing corresponding indexes according to different annotation tasks, completes index comprehensive value calculation and feeds back the effect of the annotation model.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the model labeling effect is evaluated by establishing a labeling algorithm comprehensive evaluation model, the best effect of the model is achieved by feeding back word segmentation model learning training, the model is used for a subsequent new labeling task, and the corpus word segmentation labeling quality and the algorithm model effect are improved through continuous iteration between model updating and corpus labeling. The system can provide an automatic labeling mode based on self-selection adaptive algorithm and multi-algorithm fusion aiming at different labeling use requirements and corpus characteristics, the multi-algorithm fusion automatic labeling adopts a voting method to perform fusion processing on multi-algorithm results, the performance of the integration method is superior to that of a single method under the condition of ignoring correlation, and the pre-labeling work performed by the method can reduce the complexity of the manual labeling process and reduce the labor work cost;
according to the invention, the management of the participle corpus is realized by distinguishing data from different sources; introducing an artificial judgment link, providing an applicable labeling algorithm for selection in a labeling process aiming at different segmentation corpora by integrating algorithms such as bidirectional maximum matching segmentation based on a dictionary, CRF segmentation based on CRF + Bi-LSTM segmentation, JIEBA segmentation and the like, and performing pre-labeling treatment of a single segmentation method or multi-segmentation method fusion on the corpora data to be labeled, wherein a voting method is adopted for multi-segmentation method result fusion; the automatic feedback adjustment of a real-time background word segmentation algorithm model is supported, and the corpus labeling efficiency and accuracy are greatly improved;
aiming at different participle corpora, by integrating a plurality of participle algorithms such as bidirectional maximum matching participle based on a dictionary, participle based on CRF + Bi-LSTM and the like, an applicable labeling algorithm is provided to be selectable in the labeling process, the to-be-labeled corpora data is subjected to pre-labeling processing of a single participle method or pre-labeling processing of multi-participle method fusion, and a voting method is adopted for multi-participle method result fusion; and after the labeling task is finished, retraining the segmentation model by using the labeling corpus. And finally, confirming and submitting the segmentation and labeling corpus through a manual confirmation link, and finishing the segmentation and labeling work of the corpus. The method comprises the steps of evaluating the labeling effect of a model by establishing a labeling algorithm comprehensive evaluation model, feeding back word segmentation model learning training to enable the model to achieve the best effect for a subsequent new labeling task, and improving the word segmentation labeling quality and the algorithm model effect through continuous iteration between model updating and corpus labeling.
The invention can realize sequence labeling by building a Bi-LSTM network, can realize word segmentation with the accuracy rate of about 95 percent, and finishes the work of word segmentation labeling of the corpus by modifying, confirming and submitting the word segmentation labeling corpus through a manual confirmation link; after the labeling task is completed, retraining the segmentation model by using the labeling corpus; the system supports a friendly man-machine interactive labeling interface, and simplifies the user labeling operation process;
the invention provides a unified word segmentation model access standard and supports the import, training and use of an external model. Can be applied to various electronic devices.
Drawings
FIG. 1 is a schematic diagram of the operation principle of the semi-automatic segmentation corpus labeling and training device of the present invention.
FIG. 2 is a flow diagram of the segmentation model training process of FIG. 1.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the embodiments and the accompanying drawings.
Detailed Description
Refer to the figure. In a preferred embodiment described below, a semi-automatic participle corpus annotation training device comprises: text corpus label prepares module, semi-automatization corpus participle label module, feedback type model learning training module and participle label model effect evaluation module, its characterized in that: the text corpus tagging preparation module provides preparation for a tagging task, performs single word segmentation pre-tagging processing on the corpus data to be tagged according to sources or topics by distinguishing and selecting the corpus sources of different sources, realizes the management of the corpus to be tagged and the word segmentation corpus, and submits the raw corpus word segmentation tagging work to the semi-automatic corpus word segmentation tagging module through multiple word segmentation algorithms such as bidirectional maximum matching word segmentation based on an integrated dictionary, a conditional random field CRF, JIEBA, a bidirectional LSTM network, BI-LSTM and the like; the semi-automatic corpus participle labeling module creates participle labeling tasks according to different labeling use requirements and corpus characteristics, selects a labeling applicable algorithm model, manages and develops automatic labeling according to labeling business rules, completes automatic labeling of each type of labeling tasks based on one selected participle algorithm model and business rules in multiple participle algorithms such as bidirectional maximum matching participle, CRF, JIEBA, BI-LSTM and the like of an integrated dictionary, and performs labeling result fusion based on automatic labeling results of the algorithm model and automatic labeling results based on the business rules; on the basis of automatic labeling result fusion, carrying out manual intervention judgment according to labeling service standards, storing labeling results, feeding back training model corpora and labeling models generated by a text corpus labeling preparation module to a feedback type model learning training module, carrying out model parameter setting, model corpus selection and model learning training according to the existing models and external depth enhancement model loading, and returning to model parameter setting after the models are perfected and updated; calling a unified training model interface Train to generate a core dictionary and an N-gram core dictionary, importing an external algorithm model according to a unified model access interface, updating or exporting the model, storing a participle model file comprising the core dictionary and the N-gram dictionary file, updating a participle training model table, establishing a labeling algorithm comprehensive evaluation model to evaluate the labeling effect of the model, continuously iterating between model updating and corpus labeling, updating the model for participle labeling in a platform by using the trained model, and completing a new participle labeling task. The word segmentation labeling model effect evaluation module builds and sets a single index algorithm according to index standards, quantifies indexes according to index calculation rules, builds a labeling algorithm comprehensive evaluation model by organizing corresponding indexes according to different labeling tasks, completes index comprehensive value calculation and feeds back labeling model effects.
Text corpus labeling preparation module: completing the management of the linguistic data to be labeled according to sources or topics, and providing preparation for labeling tasks; the semi-automatic corpus participle labeling module autonomously selects an adaptation algorithm and carries out automatic labeling aiming at different labeling use requirements and corpus characteristics, realizes intervention judgment of a labeling result through an artificial judgment link, and specifically comprises the following steps:
the text corpus tagging preparation module creates a participle tagging task according to different source corpora; the text corpus tagging preparation module creates a participle tagging task according to different source corpora; the semi-automatic corpus participle labeling module selects an algorithm model with adaptive effect for each type of labeling task, and in the participle labeling task, one algorithm of conditional random field CRF, JIEBA and BI-LSTM algorithm is selected according to corpus automatic labeling effect configuration condition CRF, JIEBA and BI-LSTM network BI-LSTM algorithm to complete automatic labeling. In order to create a conditional random field CRF, a set of feature functions is first defined, each feature function being entered with the entire sentence s, the current position i, the position i and the label of i-1, then a weight is assigned to each feature function, and then for each sequence of labels i, all feature functions are summed up in a weighted manner, if necessary, the summed value being converted into a probability value. The transfer matrix a for CRF is approximated by the CRF layer of the neural network, and the P matrix, i.e., the emission matrix, is approximated by Bi-LSTM.
And the model learning training module creates a service marking rule aiming at the special marking task and manages the marking service rule, wherein the marking service rule mainly comprises a service dictionary and a regular expression.
The feedback type model learning and training module provides model learning and training and feedback updating capabilities aiming at the internal and external labeling model algorithm, and the corpus is automatically labeled by adopting a labeling business rule.
The word segmentation labeling model effect evaluation module performs fusion processing on an automatic labeling result based on the algorithm model and an automatic labeling result based on the business rule; and on the basis of the automatic labeling and fusion processing result, manually modifying, confirming and storing the labeling result according to the labeling service standard. The marking personnel selects and manages the corpora of different sources through the text corpus marking preparation module, and saves the corpora of different sources as the text corpus to be marked, namely the raw corpus, according to different marking tasks; in a semi-automatic corpus participle labeling module, a corresponding participle labeling task is created, an applicable labeling algorithm model is selected, automatic pre-labeling is carried out on a corpus of the participle task based on the selected algorithm model, meanwhile, according to the particularity of the field where the data is located, business rule related to the end of business is automatically pre-labeled based on business rule, and a voting method is adopted to fuse two types of labeling results. Based on the labeled service standard, the label fusion result is modified and adjusted through a manual intervention and prejudgment link and finally stored to be a participle cooked corpus, and the corpus required by model training is provided for feedback type model learning training.
The model used in the semi-automatic participle corpus labeling module is trained and updated through a feedback model learning training module, and specifically: the existing model used for labeling can be subjected to feedback type model learning training, and an external depth strengthening model can also be adopted for feedback type model learning training; setting parameters of a word segmentation labeling model; and selecting the mature corpus required by the word segmentation model training to perform model learning training.
See fig. 2. And marking a detailed working process of the semi-automatic word segmentation corpus training device. In the participle model training process flow: the model training user selects the linguistic data for performing the participle model training through the model linguistic data selection module, selects CRF, JIEBA and BI-LSTM participle algorithm training, calls a participle training model interface Train to generate a core dictionary and an N-gram core dictionary, and the model accuracy is enabled to be optimal. Judging whether to save the word segmentation model, using the labeled corpus data to perform off-line training on trainable algorithms such as CRF, BI-LSTM and the like, importing the trained word segmentation model into an external algorithm model according to a unified word segmentation training model access interface, updating or exporting the model, saving a word segmentation model file comprising a core dictionary and an N-gram dictionary file, and updating a word segmentation training model table. After the word segmentation model is updated, a word segmentation service is started, CRF, JIEBA and BI-LSTM word segmentation algorithm training is selected, a new word segmentation switch is added to a configuration file, whether the word segmentation model is updated or not is judged, if yes, a specified word segmentation model is read, the name of the word segmentation model is obtained, otherwise, a word segmentation training model table and an algorithm self-contained core dictionary are read, a dictionary is combined, the word segmentation training model table and a perfected model are updated, the updated labeling model is fed back to the algorithm self-contained core dictionary of the semi-automatic word segmentation corpus labeling module, and a model used for word segmentation labeling in a platform is updated by using the trained model, so that a new word segmentation labeling task is completed.
The word segmentation labeling model effect evaluation module provides methods for model evaluation index construction labeling, construction rules, index quantification and the like, supports evaluation of model labeling effect through automatic construction of a labeling algorithm comprehensive evaluation model, and specifically comprises the following steps: the word segmentation labeling model effect evaluation module constructs and sets a single index algorithm according to the index standard; quantifying the indexes according to an index calculation rule, and constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes of the organization according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.
The basic evaluation indexes for performing the word segmentation and word segmentation corpus labeling by the device in the embodiment include segmentation accuracy precision, segmentation Recall rate Recall, F measure, intersection ambiguity accuracy, combined ambiguity accuracy, and categorical labeling accuracy. The specific definition is as follows:
Figure BDA0002076348930000071
Figure BDA0002076348930000072
wherein, F represents the F value, namely the harmonic mean value of the accuracy and the recall ratio, P represents the accuracy, and R represents the recall ratio.
The accuracy and recall are generally referred to as an inverse relationship. Increasing accuracy by some methods can lead to a decrease in recall and vice versa. In order to define the different requirements of the application system for accuracy and recall, a weighting value can be taken into consideration, so as to obtain the value E:
Figure BDA0002076348930000073
wherein, b is the added weight, the larger b is, the larger the weight of the accuracy rate in the consideration of the E value is, otherwise, the larger the weight of the recall rate is.
The segmentation ambiguity is also a difficulty of the automatic Chinese word segmentation algorithm, and in order to examine the capability of the algorithm in resolving ambiguity, a separate examination index is specially made for the part with ambiguity. Specifically, the method aims at two different ambiguity types of an intersection type and a combination type: the intersection type ambiguity and the combined type ambiguity respectively define the accuracy rates as follows:
Figure BDA0002076348930000074
Figure BDA0002076348930000075
similar to the segmentation ambiguity, the part-of-speech tagging also has its own "ambiguous word" called a facultative word. A word is called a facultative word if it has two or more different parts of speech. Obviously, the labeling of the part-of-speech word is the key point and the difficulty of part-of-speech labeling, and for this reason, a special index part-of-speech word labeling accuracy rate is defined to examine it, which is specifically defined as follows
Figure BDA0002076348930000076
The method comprises the steps of managing linguistic data to be labeled according to sources or topics, and providing preparation for labeling tasks; the semi-automatic labeling of the participle corpus is completed by integrating a plurality of participle processing algorithms such as CRF, JIEBA, BI-LSTM and the like, an applicable labeling algorithm is provided for selection in the labeling process, and participle pre-labeling processing is carried out on corpus data to be labeled; and finally, modifying, confirming and submitting the labeled corpus through a manual confirmation link to finish corpus labeling work. And after the labeling task is completed, retraining the model by using the labeling corpora. And evaluating the labeling effect of the model by establishing a labeling algorithm comprehensive evaluation model, feeding back model learning training to enable the model to achieve the best effect for the subsequent new labeling task, and improving the corpus labeling quality and the algorithm model effect through continuous iteration between model updating and corpus labeling.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (10)

1. A semi-automatic participle corpus labeling training device, comprising: text corpus label prepares module, semi-automatization corpus participle label module, feedback type model learning training module and participle label model effect evaluation module, its characterized in that: the text corpus tagging preparation module provides preparation for a tagging task, performs single word segmentation pre-tagging processing on corpus data to be tagged according to sources or topics by distinguishing and corpus source selection on the data from different sources to realize the management of the corpus data to be tagged and the word segmentation corpus, and submits the raw corpus word segmentation tagging work to the semi-automatic corpus word segmentation tagging module through multiple word segmentation algorithms of two-way maximum matching word segmentation, conditional random field CRF, JiEBA Chinese word segmentation, two-way LSTM network and BI-LSTM based on an integrated dictionary; the semi-automatic corpus participle labeling module creates participle labeling tasks according to different labeling use requirements and corpus characteristics, selects a labeling applicable algorithm model, manages and develops automatic labeling according to labeling business rules, completes automatic labeling of each type of labeling tasks by selecting one participle algorithm model and one business rule selected from multiple participle algorithms of bidirectional maximum matching participle, CRF, JIEBA and BI-LSTM of an integrated dictionary, and performs labeling result fusion based on automatic labeling results of the algorithm models and automatic labeling results of the business rules; on the basis of automatic labeling result fusion, carrying out manual intervention judgment according to labeling service standards, storing labeling results, feeding back training model corpora and labeling models generated by a text corpus labeling preparation module to a feedback type model learning training module, carrying out loading model parameter setting, model corpus selection and model learning training according to an existing model and an external depth enhancement model, and returning to model parameter setting after the model is perfected and updated; calling a unified training model interface Train to generate a core dictionary and an N-gram core dictionary, importing an external algorithm model according to a unified model access interface, updating or exporting the model, storing a segmentation model file comprising the core dictionary and the N-gram dictionary file, updating a segmentation training model table, establishing a labeling algorithm comprehensive evaluation model, evaluating the labeling effect of the model, and updating the model for segmentation labeling in the platform by using the trained model through continuous iteration between model updating and corpus labeling to complete a new segmentation labeling task.
2. The semi-automated participle corpus tagging training device of claim 1, wherein: the word segmentation labeling model effect evaluation module is used for constructing and setting a single index algorithm according to index standards, quantizing indexes according to index calculation rules, constructing a labeling algorithm comprehensive evaluation model by organizing corresponding indexes according to different labeling tasks, completing index comprehensive value calculation and feeding back the labeling model effect.
3. The semi-automated participle corpus tagging training device of claim 1, wherein: the semi-automatic corpus participle labeling module autonomously selects an adaptive algorithm and carries out automatic labeling aiming at different labeling use requirements and corpus characteristics, and intervention judgment of a labeling result is realized through a manual judgment link.
4. The semi-automated participle corpus tagging training device of claim 1, wherein: the text corpus tagging preparation module creates a participle tagging task according to different source corpora; and the semi-automatic corpus participle labeling module selects an algorithm model with adaptive effect for each type of labeling task, and selects CRF, JIEBA and BI-LSTM participle algorithms to finish automatic labeling according to corpus automatic labeling effect configuration CRF, JIEBA and BI-LSTM participle algorithms in the participle labeling task.
5. The semi-automated participle corpus tagging training device of claim 1, wherein: the model learning training module creates a service marking rule aiming at the special marking task and manages the marking service rule, and the marking service rule comprises a service dictionary and a regular expression; the feedback type model learning and training module provides model learning and training and feedback updating capabilities aiming at the internal and external labeling model algorithm, and the corpus is automatically labeled by adopting a labeling business rule.
6. The semi-automated participle corpus tagging training device of claim 1, wherein: the word segmentation labeling model effect evaluation module performs fusion processing on an automatic labeling result based on the algorithm model and an automatic labeling result based on the business rule; and on the basis of the automatic labeling and fusion processing result, manually modifying, confirming and storing the labeling result according to the labeling service standard.
7. The semi-automated participle corpus tagging training device of claim 1, wherein: the text corpus labeling preparation module selects and manages the corpora from different sources, and stores the corpora as the text corpus to be labeled, namely the raw corpus, according to different labeling tasks; in a semi-automatic corpus participle labeling module, a corresponding participle labeling task is created, an applicable labeling algorithm model is selected, automatic pre-labeling is carried out on a corpus of the participle task based on the selected algorithm model, meanwhile, according to the particularity of the field where the data is located, business rule related to the end of business is automatically pre-labeled based on business rule, and a voting method is adopted to fuse two types of labeling results.
8. The semi-automated participle corpus tagging training device of claim 1, wherein: the semi-automatic participle corpus labeling module performs model training and updating through a feedback type model learning training module, performs feedback type model learning training on an existing model used for labeling, or performs feedback type model learning training by adopting an external depth strengthening model; setting parameters of a word segmentation labeling model; and selecting the mature corpus required by the word segmentation model training to perform model learning training.
9. The semi-automated participle corpus tagging training device of claim 1, wherein: in the participle model training process flow: the model corpus selection module selects a corpus used for training a participle model, selects a CRF, JIEBA and BI-LSTM participle algorithm for training, calls a participle training model interface Train to generate a core dictionary and an N-gram core dictionary, and enables the model accuracy to be optimal; judging whether to save the word segmentation model, using the labeled corpus data to perform off-line training on a CRF (learning fuzzy control) and BI-LSTM (business intelligence training) trainable algorithm, importing an external algorithm model according to a unified word segmentation training model access interface, updating or exporting the model, saving a word segmentation model file comprising a core dictionary and an N-gram dictionary file, and updating a word segmentation training model table.
10. The semi-automated participle corpus tagging training device of claim 9, wherein: after the word segmentation model is updated, a word segmentation service is started, CRF, JIEBA and BI-LSTM word segmentation algorithm training is selected, a new word segmentation switch is added to a configuration file, whether the word segmentation model is updated or not is judged, if yes, a specified word segmentation model is read, the name of the word segmentation model is obtained, otherwise, a word segmentation training model table and an algorithm self-contained core dictionary are read, a dictionary is combined, the word segmentation training model table and a perfected model are updated, the updated labeling model is fed back to the algorithm self-contained core dictionary of the semi-automatic word segmentation corpus labeling module, the trained model is used for updating the model used for word segmentation labeling in the platform, and a new word segmentation labeling task is completed.
CN201910455093.XA 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device Active CN110287482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455093.XA CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455093.XA CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Publications (2)

Publication Number Publication Date
CN110287482A CN110287482A (en) 2019-09-27
CN110287482B true CN110287482B (en) 2022-07-08

Family

ID=68002801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455093.XA Active CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Country Status (1)

Country Link
CN (1) CN110287482B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826331B (en) * 2019-10-28 2023-04-18 南京师范大学 Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN111008706B (en) * 2019-12-09 2023-05-05 长春嘉诚信息技术股份有限公司 Processing method for automatically labeling, training and predicting mass data
CN111597807B (en) * 2020-04-30 2022-09-13 腾讯科技(深圳)有限公司 Word segmentation data set generation method, device, equipment and storage medium thereof
CN111582388A (en) * 2020-05-11 2020-08-25 广州中科智巡科技有限公司 Method and system for quickly labeling image data
CN112101014B (en) * 2020-08-20 2022-07-26 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method
CN113206854B (en) * 2021-05-08 2022-12-13 首约科技(北京)有限公司 Method and device for rapidly developing national standard terminal protocol

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN107622050A (en) * 2017-09-14 2018-01-23 武汉烽火普天信息技术有限公司 Text sequence labeling system and method based on Bi LSTM and CRF
CN108256029A (en) * 2018-01-11 2018-07-06 北京神州泰岳软件股份有限公司 Statistical classification model training apparatus and training method
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
CN109446369A (en) * 2018-09-28 2019-03-08 武汉中海庭数据技术有限公司 The exchange method and system of the semi-automatic mark of image
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10606982B2 (en) * 2017-09-06 2020-03-31 International Business Machines Corporation Iterative semi-automatic annotation for workload reduction in medical image labeling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN107622050A (en) * 2017-09-14 2018-01-23 武汉烽火普天信息技术有限公司 Text sequence labeling system and method based on Bi LSTM and CRF
CN108256029A (en) * 2018-01-11 2018-07-06 北京神州泰岳软件股份有限公司 Statistical classification model training apparatus and training method
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
CN109446369A (en) * 2018-09-28 2019-03-08 武汉中海庭数据技术有限公司 The exchange method and system of the semi-automatic mark of image
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
侯超.基于自然语言处理的策略生成系统的设计与实现.《中国优秀硕士学位论文全文数据库 信息科技辑》.2013, *

Also Published As

Publication number Publication date
CN110287482A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110287482B (en) Semi-automatic participle corpus labeling training device
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN110298032B (en) Text classification corpus labeling training system
CN110298033B (en) Keyword corpus labeling training extraction system
CN110287481A (en) Name entity corpus labeling training system
CN107562863A (en) Chat robots reply automatic generation method and system
CN111209412A (en) Method for building knowledge graph of periodical literature by cyclic updating iteration
CN108304372A (en) Entity extraction method and apparatus, computer equipment and storage medium
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
CN111026884A (en) Dialog corpus generation method for improving quality and diversity of human-computer interaction dialog corpus
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN111858842A (en) Judicial case screening method based on LDA topic model
CN115878778A (en) Natural language understanding method facing business field
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN109325243A (en) Mongolian word cutting method and its word cutting system of the character level based on series model
CN110633468B (en) Information processing method and device for object feature extraction
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN116108175A (en) Language conversion method and system based on semantic analysis and data construction
CN115017271A (en) Method and system for intelligently generating RPA flow component block
CN110851572A (en) Session labeling method and device, storage medium and electronic equipment
CN115238705A (en) Semantic analysis result reordering method and system
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN114168720A (en) Natural language data query method and storage device based on deep learning
Zhu English Lexical Analysis System of Machine Translation Based on Simple Recurrent Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant