CN110287482A - Semi-automation participle corpus labeling training device - Google Patents

Semi-automation participle corpus labeling training device Download PDF

Info

Publication number
CN110287482A
CN110287482A CN201910455093.XA CN201910455093A CN110287482A CN 110287482 A CN110287482 A CN 110287482A CN 201910455093 A CN201910455093 A CN 201910455093A CN 110287482 A CN110287482 A CN 110287482A
Authority
CN
China
Prior art keywords
participle
model
corpus
mark
automatic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910455093.XA
Other languages
Chinese (zh)
Other versions
CN110287482B (en
Inventor
代翔
崔莹
黄细凤
孙涛
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201910455093.XA priority Critical patent/CN110287482B/en
Publication of CN110287482A publication Critical patent/CN110287482A/en
Application granted granted Critical
Publication of CN110287482B publication Critical patent/CN110287482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

A kind of semi-automatic participle corpus labeling training device of the present invention, it is intended to solve in participle corpus labeling and training process using corpus there are the drawbacks of.The technical scheme is that: corpus of text marks preparation module to corpus to be marked, segment the management of corpus, pass through the two-way maximum matching participle based on integrated dictionary, CRF, JIEBA, etc. a variety of segmentation methods, semi-automatic corpus participle labeling module is submitted into raw corpus participle mark work, creation participle mark task, selection, which marks, is applicable in algorithm model, carry out automatic marking, on the basis of the fusion of automatic marking result, training pattern corpus and marking model that corpus of text mark preparation module generates are fed back into reaction type model learning training module, selection and model learning training, unified training pattern interface is called to generate core lexicon, update participle training pattern table, dimensioning algorithm Integrated Evaluation Model is established to assess model mark effect, complete new participle mark task.

Description

Semi-automation participle corpus labeling training device
Technical field
The present invention relates to Text Mining Technology fields, more particularly to participle corpus semi-automation to mark training device.
Background technique
Word be it is the smallest, can independent activities, significant language element, but in Chinese between word without apparent Separator, therefore, Chinese lexical analysis are basis and the key of Chinese information processing.The accuracy of participle and part-of-speech tagging Accuracy is closely related, will organically segment process together with part-of-speech tagging Process fusion, and be conducive to disambiguation and raising Whole efficiency.Chinese sentence is made of continuous word, does not have space to separate between word and word.Part-of-speech tagging refers in sentence Each word determine the process of a suitable part of speech.Chinese word segmentation is first " process " of Chinese information processing again, is being permitted It is played an extremely important role in more application fields (text participle, event extraction, text snippet, information retrieval etc.).Participle All it is to carry out basic processing to corpus with part-of-speech tagging, is referred to as corpus participle mark.However there is the participle corpus of mark very It less, is that indirectly, in a real system, difference segments the influence of mistake for the raising of the big mission effectiveness where participle Be it is very different, in addition segment corpus obtain cost it is very expensive, be manually difficult expertly according to some standard before It consistently goes to mark raw corpus afterwards, so that the scale for segmenting corpus is fairly limited in today of the big computing capability of big data quantity.Word Property be labeled on message processing flow the step of being and then after participle, and used algorithm principle is similar with participle, So usually carrying out integrated processing to participle and part-of-speech tagging in the realization of many systems.However, dividing in field at present Word material is relatively deficient, and segments corpus labeling work and mainly completed at present by manually marking, and manually does word to corpus entirely Property mark bustling about just as ant, be when expending very much, and that there are corpus labelings is of poor quality, annotation process is numerous Trivial, the problems such as annotating efficiency is low, cost of human resources is high.Meanwhile having a participle corpus labeling tool that there are mask methods is single, Mask method model can not be carried out the drawback such as automatically updating, therefore, be capable of indirect labor there is an urgent need to a set of and mark corpus Semi-automatic participle mark solves problem above with training platform.If there is a semi-automatic participle mask method and it is based on The semi-automatic annotation equipment of this method design can be fully automated ground to participle corpus to be processed, provide one rapidly Pre- annotation results are just very good in this way.
In recent years, with the high speed development of big data acquisition obtaining means, excavating to maximize to be worth from data becomes outstanding To be urgent, this proposes completely new demand to the intelligent analysis of big data.In this context, the skills such as machine learning, deep learning Art using upper fast development and obtains immense success in big data, and the model algorithm that technology bottom uses more needs to rely on The training support based on a large amount of data mark corpus.The work of mass data corpus labeling has the training of algorithm model Great influence, while as the basic work during big data analysis, the main support daily research and development of big data, algorithm tune The links such as excellent, demonstration and verification are the key foundations of big data mining analysis.Participle is critically depend on dictionary, stammers at present Although the dictionary that JIEBA is provided is not very complete, but enough for general application.Stammerer (jieba) plug-in unit, One section of Chinese can be segmented, there are three types of the modes of participle, are adapted to different demands.Chinese word segmentation (Chinese Word Segmentation it) refers to a chinese character sequence being cut into individual word one by one.Participle is exactly by continuous word sequence Column are reassembled into the process of word sequence according to certain specification.
Existing segmentation methods can be divided into three categories: the segmenting method based on string matching, the segmenting method based on understanding and Segmenting method based on statistics.Segmenting method based on string matching: this method, which is called, does mechanical segmentation method, it be by The Chinese character string being analysed to according to certain strategy is matched with the entry in " sufficiently big " machine dictionary, if in dictionary Some character string is found, then successful match (identifying a word).
1) Forward Maximum Method method (by left-to-right direction)
2) reverse maximum matching method (by right to left direction):
3) minimum cutting (keeping the word number cut out in each sentence minimum)
4) two-way maximum matching method (carry out by it is left-to-right, by right to left twice sweep)
Segmenting method based on understanding: this segmenting method is to reach identification by allowing the understanding of computer mould personification distich The effect of word.Its basic thought be exactly participle while carry out syntax, semantic analysis, using syntactic information and semantic information come Handle Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.In master control part Coordination under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, I.e. it simulates people to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.Due to the Chinese General, the complexity of language linguistry, it is difficult to various language messages are organized into the form that machine can be directly read, therefore at present Words partition system based on understanding is also in experimental stage.
Segmenting method based on statistics: providing the text largely segmented, is cut using statistical machine learning model study word The rule (referred to as trained) divided, to realize the cutting to unknown text.Such as maximum probability segmenting method and maximum entropy segment Method etc..With the foundation of Large Scale Corpus, the research and development of statistical machine learning method, the Chinese word segmentation based on statistics Method becomes main stream approach gradually.
Principal statistical model: N-gram model (N-gram), hidden Markov model (Hidden Markov Model, HMM), Maximum entropy model (ME), conditional random field models (Conditional Random Fields, CRF) etc..Morphological analysis is NLP An important basic technology, including participle, part-of-speech tagging, Entity recognition etc., main algorithm structure is based on Bi- LSTM-CRF algorithm system.With the output sequence that CRF is to obtain global optimum, it is equivalent to the recycling to lstm information.From From network structure, Bi-LSTM-CRF is applied or CRF this big frame, only LSTM in each t moment the Output on i tag regards " point function " (only characteristic function related with current location) in CRF characteristic function as, then " side function " (characteristic function related with front-rear position) is still carried with CRF.Thus by primitive form in linear chain CRF Characteristic function (linear) become the output f of LSTM1(non-linear), this just introduced in original CRF it is non-linear, can be more preferable Fitting data.Bi-LSTM, that is, two-way LSTM, more unidirectional LSTM, Bi-LSTM can preferably capture the letter of context in sentence Breath.Bi-LSTM is exactly two and remembers LSTM based on shot and long term in fact, and only reversed LSTM is the data elder generation input Reverse head and the tail transposition once, then run a normal LSTM, then again output result reverse once make with just To the input of LSTM be mapped.
Summary of the invention
Goal of the invention of the invention place in view of the shortcomings of the prior art, is conceived to and solves above-mentioned participle corpus labeling And in training process using corpus there are the drawbacks of, propose a kind of semi-automatic participle corpus labeling training device.
Above-mentioned purpose of the invention can be reached by following measures, a kind of semi-automatic participle corpus labeling training cartridge It sets, comprising: corpus of text marks preparation module, semi-automatic corpus segments labeling module, reaction type model learning training module With participle marking model recruitment evaluation module, it is characterised in that: corpus of text marks preparation module and provides preparation for mark task, It is distinguished by the data to separate sources and corpus source selects, by sources or theme carries out list to corpus data to be marked The pre- mark processing of one participle, realizes the management to corpus to be marked, participle corpus, then by based on the two-way of integrated dictionary A variety of segmentation methods such as maximum matching participle, CRF, JIEBA, BI-LSTM are submitted to raw corpus participle mark work semi-automatic Change corpus and segments labeling module;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, wound Participle mark task is built, selection, which marks, is applicable in algorithm model, carries out automatic marking by mark Business Rule Management, based on integrated word The two-way maximum of allusion quotation matches a kind of segmentation methods mould selected in a variety of segmentation methods such as participle, CRF, JIEBA, BI-LSTM Type and business rule complete the automatic marking of each class mark task, automatic marking result based on algorithm model and are based on business The automatic marking result of rule is labeled result fusion;On the basis of the fusion of automatic marking result, according to mark business mark Standard carries out manual intervention and sentences card, saves annotation results, the training pattern corpus and mark that corpus of text mark preparation module is generated Injection molding type feeds back to reaction type model learning training module, carries out model according to existing model and external depth enhancing model load Parameter setting, the selection of model corpus and model learning training, model are improved after updating and return again to model parameter setting;Call system After one training pattern interface Train generates core lexicon and N-gram core lexicon, imported by unified model access interface external Algorithm model is updated model or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file, And participle training pattern table is updated, it establishes dimensioning algorithm Integrated Evaluation Model and model mark effect is assessed, pass through model The continuous iteration between corpus labeling is updated, is carried out more in platform for segmenting the model of mark using trained model Newly, new participle mark task is completed;It segments marking model recruitment evaluation module and single index is built according to criterion Algorithm quantifies index according to index computation rule, according to different labeled task using tissue corresponding index building mark Algorithm synthesis assessment models, the integrated value that hits the target are calculated, are fed back to marking model effect.
The present invention have compared with the prior art it is following the utility model has the advantages that
The present invention assesses model mark effect by establishing dimensioning algorithm Integrated Evaluation Model, feedback participle model study Training makes model reach best effects, subsequent newly-increased mark task is used for, by continuous between model modification and corpus labeling Iteration improves corpus participle mark quality and algorithm model effect.System can be directed to different labeled use demand and corpus feature, The automatic marking mode based on autonomous selection adaptation algorithm and more algorithm fusions is provided, more algorithm fusion automatic markings are using ballot Method carries out fusion treatment to more arithmetic results, and under conditions of ignoring correlation, the performance of integrated approach is better than single method, The complicated degree of artificial annotation process can be reduced by the pre- mark work that this method carries out, mitigates manual work's cost;
The present invention is distinguished by the data to separate sources, realizes the management to participle corpus;Introducing manually sentences card link, By the integrated two-way maximum matching participle based on dictionary, it is based on CRF participle, based on CRF+Bi-LSTM participle, JIEBA participle Scheduling algorithm, for different participle corpus, applicable dimensioning algorithm is provided in annotation process be may be selected, to corpus data to be marked Carry out the pre- mark processing of the processing of pre- mark or the fusion of a variety of segmenting methods of single segmenting method, many of segmenting method knot Fruit fusion uses voting method;It supports the automatic feedback of real-time backstage segmentation methods model to adjust, greatly improves corpus labeling effect Rate and accuracy rate;
The present invention is directed to different participle corpus, by the integrated two-way maximum matching participle based on dictionary, based on CRF participle, base In a variety of segmentation methods such as CRF+Bi-LSTM participles, applicable dimensioning algorithm is provided in annotation process be may be selected, to be marked Corpus data carries out the pre- mark processing of the processing of pre- mark or the fusion of more segmenting methods of single segmenting method, more segmenting method knots Fruit fusion uses voting method;After the completion of mark task, re -training is carried out to participle model using mark corpus.Finally lead to It crosses manual confirmation link and confirmation submission is carried out to participle mark corpus, complete corpus participle mark work.It is calculated by establishing mark Method Integrated Evaluation Model assesses model mark effect, and feedback participle model learning training makes model reach best effects, For subsequent newly-increased mark task, corpus participle mark quality is improved by the continuous iteration between model modification and corpus labeling With algorithm model effect.
The present invention can be by building Bi-LSTM network implementations sequence labelling, and can realize participle, and accuracy rate can reach 95% or so, it is modified by manual confirmation link to participle mark corpus, confirms, submits, complete corpus participle mark work Make;After the completion of mark task, re -training is carried out to participle model using mark corpus;System is supported through the man-machine of close friend Interactive mode mark interface, simplifies user annotation operating process;
The present invention provides unified participle model access standard, supports the importing, training and use of external model.It can be in various electricity It is applied in sub- equipment.
Detailed description of the invention
Fig. 1 is the semi-automatic operation principle schematic diagram for segmenting corpus labeling, training device of the invention.
Fig. 2 is the participle model training managing flow chart of Fig. 1.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.
Specific embodiment
Refering to figure.In preferred embodiment described below, a kind of semi-automatic participle corpus labeling training device, packet Include: corpus of text marks preparation module, semi-automatic corpus participle labeling module, reaction type model learning training module and participle Marking model recruitment evaluation module, it is characterised in that: corpus of text marks preparation module and provides preparation for mark task, by right The data of separate sources are distinguished to be selected with corpus source, and by sources or theme carries out single participle to corpus data to be marked The processing of pre- mark, realize to corpus to be marked, segment the management of corpus, then pass through two-way maximum based on integrated dictionary It is marked with a variety of segmentation methods such as participle, condition random field CRF, JIEBA, two-way LSTM network, BI-LSTM, by raw corpus participle Semi-automatic corpus participle labeling module is submitted in note work;Semi-automatic corpus participle labeling module is used for different labeled Demand and corpus feature, creation participle mark task, selection mark are applicable in algorithm model, carry out certainly by mark Business Rule Management Dynamic mark, the two-way maximum based on integrated dictionary match selected in a variety of segmentation methods such as participle, CRF, JIEBA, BI-LSTM A kind of segmentation methods model and business rule complete the automatic marking of each class mark task, the automatic mark based on algorithm model Note result and the automatic marking result based on business rule are labeled result fusion;On the basis of automatic marking result fusion On, manual intervention is carried out according to mark traffic criteria and sentences card, saves annotation results, and corpus of text mark preparation module is generated Training pattern corpus and marking model feed back to reaction type model learning training module, are enhanced according to existing model and external depth Model load carries out model parameter setting, the selection of model corpus and model learning training, model is improved after updating and returns again to mould Shape parameter setting;After calling unified training pattern interface Train to generate core lexicon and N-gram core lexicon, by unified model Access interface imports external algorithm model, is updated or exports to model, saves comprising core lexicon and N-gram dictionary text The participle model file of part, and participle training pattern table is updated, it establishes dimensioning algorithm Integrated Evaluation Model and effect is marked to model Assessed, by the continuous iteration between model modification and corpus labeling, using trained model in platform for point The model of word mark is updated, and completes new participle mark task.Marking model recruitment evaluation module is segmented according to index mark Standard builds single index algorithm, quantifies according to index computation rule to index, uses group according to different labeled task Corresponding index building dimensioning algorithm Integrated Evaluation Model is knitted, the integrated value that hits the target is calculated, fed back to marking model effect.
Corpus of text marks preparation module: completing to corpus to be marked by sources or theme is managed, to mark task It provides and prepares;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, autonomous selection adaptation Algorithm simultaneously carries out automatic marking, realizes that card is sentenced in the intervention of annotation results by manually sentencing card link, the specific steps are as follows:
Corpus of text marks preparation module and creates participle mark task according to separate sources corpus;Corpus of text marks preparation module Participle mark task is created according to separate sources corpus;Semi-automatic corpus participle labeling module is for the mark task choosing of each class Select effect adaptation algorithm model, participle mark task in, according to corpus automatic marking effect configuration condition random field CRF, JIEBA, two-way LSTM network B I-LSTM algorithms selection CRF, JIEBA, BI-LSTM one of which algorithm complete automatic marking.For A condition random field CRF is built, first has to define a feature function set, each characteristic function is with entire sentence s, currently The label of position i, position i and i-1 are input, then assign a weight for each characteristic function, are then directed to each Annotated sequence l if necessary, can be converted into the value of summation one probability value to all characteristic function weighted sums. The shift-matrix A of CRF is obtained by the CRF layer approximation of neural network, and P matrix i.e. emission matrix are obtained by Bi-LSTM is approximate It arrives.
Model learning training module is carried out for special mark task creation business mark rule, and to mark business rule Management, marking business rule here mainly includes business dictionary and regular expression.
Reaction type model learning training module is directed to inside and outside marking model algorithm, provides model learning training, feedback more New ability carries out automatic marking to corpus using mark business rule.
Marking model recruitment evaluation module is segmented to the automatic marking result based on algorithm model and based on business rule Automatic marking result carries out fusion treatment;It is artificial right according to mark traffic criteria on the basis of automatic marking fusion treatment result Annotation results are modified, confirm and are saved.Mark personnel mark preparation module by corpus of text and select separate sources corpus It selects and manages, corpus of text to be marked is saved as by different labeled task, i.e., raw corpus;It segments and marks in semi-automatic corpus In module, create corresponding participle mark task, and select applicable dimensioning algorithm model, based on selected algorithm model to point Word task corpus carries out automatic pre- mark, meanwhile, for the particularity in field locating for data, related service rule of completing a course carries out base In the automatic pre- mark of business rule, two class annotation results are merged using ballot method.Traffic criteria based on mark is led to It crosses manual intervention and sentences card link and mark fusion results are modified, adjusted, it is reaction type that final save, which becomes participle idiom material, Corpus needed for model learning training provides model training.
Model used in semi-automation participle corpus labeling module is carried out by reaction type model learning training module Model training and update, it is specific: the training of reaction type model learning can be carried out to existing model used in mark, can also be used External depth reinforces model and carries out the training of reaction type model learning;Setting participle marking model parameter;Select participle model training Required idiom material carries out model learning training.
Refering to Fig. 2.Corpus labeling training device detailed operation process is segmented for semi-automation.In participle model training managing In stream: the corpus that model training user is trained by the selection of model corpus selecting module for doing participle model, selection CRF, The training of JIEBA, BI-LSTM segmentation methods calls participle training pattern interface Train, generates core lexicon and N-gram core Dictionary makes model accuracy reach best.Judge whether to save participle model, using having marked corpus data to CRF, BI- LSTM etc. can training algorithm carry out off-line training, external algorithm model is imported by unified participle training pattern access interface, to mould Type is updated or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file, and updates participle instruction Practice model table.After participle model updates, start Chinese Word Segmentation Service, selects the training of CRF, JIEBA, BI-LSTM segmentation methods, configuration text Part increases new participle switch, judges whether to update participle model, is then to read specified participle model, obtains participle model title, Otherwise it reads participle training pattern table and reads algorithm and carry core lexicon, merge dictionary, update participle training pattern table and complete Kind model, and the algorithm that updated marking model feeds back to semi-automatic participle corpus labeling module is carried into core lexicon, It is updated in platform for segmenting the model of mark using trained model, completes new participle mark task.
It segments marking model recruitment evaluation module and model evaluation index building mark, building rule, quantification of targets etc. is provided Method is supported to assess model mark effect by constructing dimensioning algorithm Integrated Evaluation Model automatically, the specific steps are as follows: It segments marking model recruitment evaluation module and single index algorithm is built according to criterion;According to index computation rule to finger Mark is quantified, and constructs dimensioning algorithm Integrated Evaluation Model using tissue corresponding index according to different labeled task;It hits the target Integrated value calculates, and feeds back to marking model effect.
The present embodiment includes cutting to the Basic Evaluation index of participle mark participle corpus annotation is carried out by the present apparatus Accuracy rate precision, cutting recall rate Recall, F estimates, crossing ambiguity accuracy rate, make-up ambiguity accuracy rate, ambiguous category Mark accuracy rate etc..It is defined as follows:
Wherein, F indicates F value, the as harmonic-mean of accuracy and recall rate, and P indicates accuracy rate, and R indicates recall rate.
Accuracy rate and recall rate are commonly referred to as the relationship of inverse ratio.Accuracy rate is improved by certain methods, will lead under recall rate Drop, vice versa.In order to define application system for the different demands of accuracy rate and recall rate, a weighted value pair can be provided What it was weighted considers, to obtain E value:
Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary then recall rate Weight is bigger.
Cutting ambiguity is also a difficult point of Algorithm of Automatic Chinese Word Segmentation, in order to investigate algorithm to the ability of ambiguity resolution, Individual inspection target especially is done to the part there are ambiguity.Specifically, being directed to overlap type and combined two different ambiguities Type: it is as follows to define accurate rate respectively for crossing ambiguity and make-up ambiguity:
Similar with cutting ambiguity, there is also oneself " ambiguity words ", referred to as conversion of parts of speech for part-of-speech tagging.If a word possesses two A or more different parts of speech, are known as conversion of parts of speech.Obviously, the mark of conversion of parts of speech is the key points and difficulties of part-of-speech tagging, right This, defines a special index conversion of parts of speech mark accurate rate to investigate to it, is defined as follows
By by sources or theme is managed, providing preparation to corpus to be marked for mark task;By integrated CRF, A variety of word segmentation processing algorithms such as JIEBA, BI-LSTM are completed the semi-automatic mark of participle corpus, are provided in annotation process suitable Dimensioning algorithm may be selected, and carry out segmenting pre- mark processing to corpus data to be marked;Finally by manual confirmation link pair Mark corpus is modified, confirms and is submitted, and corpus labeling work is completed.After the completion of mark task, mark corpus pair is used Model carries out re -training.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model, feedback model Learning training makes model reach best effects, subsequent newly-increased mark task is used for, by between model modification and corpus labeling Continuous iteration improves corpus labeling quality and algorithm model effect.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention, Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.

Claims (10)

1. a kind of semi-automatic participle corpus labeling training device, comprising: corpus of text marks preparation module, semi-automatic corpus Segment labeling module, reaction type model learning training module and participle marking model recruitment evaluation module, it is characterised in that: text Corpus labeling preparation module provides preparation for mark task, is distinguished by the data to separate sources and corpus source is selected Select, by sources or theme to corpus data to be marked carry out single participle pre- mark handle, realize to corpus to be marked, participle Then the management of corpus passes through two-way maximum matching participle, condition random field CRF, stammerer JIEBA Chinese based on integrated dictionary Raw corpus participle mark work is submitted to semi-automatic corpus by more kinds of participle, two-way LSTM network, BI-LSTM segmentation methods Segment labeling module;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, creation participle Mark task, selection, which marks, is applicable in algorithm model, carries out automatic marking by mark Business Rule Management, from the two-way of integrated dictionary Maximum matching participle, a kind of segmentation methods model and business selected in more kinds of segmentation methods of CRF, JIEBA, BI-LSTM are advised Then, the automatic marking of each class mark task, the automatic mark of automatic marking result and business rule based on algorithm model are completed Note result is labeled result fusion;On the basis of the fusion of automatic marking result, manually done according to mark traffic criteria Anticipation card, saves annotation results, and training pattern corpus and marking model that corpus of text mark preparation module generates are fed back to Reaction type model learning training module carries out stress model parameter setting, mould according to existing model and external depth enhancing model The selection of type corpus and model learning training, model are improved after updating and return again to model parameter setting;Call unified training pattern After interface Train generates core lexicon and N-gram core lexicon, external algorithm model is imported by unified model access interface, it is right Model is updated or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file, and update participle Training pattern table establishes dimensioning algorithm Integrated Evaluation Model, assesses model mark effect, passes through model modification and corpus Continuous iteration between mark is updated for segmenting the model of mark in platform using trained model, is completed new Participle mark task.
2. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: participle marking model effect Fruit evaluation module mark builds single index algorithm according to criterion, according to index computation rule to the index amount of progress Change, dimensioning algorithm Integrated Evaluation Model, the integrated value that hits the target meter are constructed using tissue corresponding index according to different labeled task It calculates, marking model effect is fed back.
3. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: semi-automatic corpus point Word labeling module is directed to different labeled use demand and corpus feature, autonomous to select adaptation algorithm and carry out automatic marking, passes through Manually sentence card link and realizes that card is sentenced in the intervention of annotation results.
4. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: corpus of text mark is quasi- Standby module creates participle mark task according to separate sources corpus;Semi-automatic corpus participle labeling module is marked for each class Task choosing effect adaptation algorithm model, participle mark task in, according to corpus automatic marking effect configure CRF, JIEBA, BI-LSTM algorithms selection CRF, JIEBA, BI-LSTM segmentation methods complete automatic marking.
5. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: model learning training mould Block is managed for special mark task creation business mark rule, and to mark business rule, and marks business rule Including business dictionary and regular expression;Reaction type model learning training module is directed to inside and outside marking model algorithm, provides mould Type learning training, feedback updating ability carry out automatic marking to corpus using mark business rule.
6. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: participle marking model effect Fruit evaluation module carries out at fusion the automatic marking result based on algorithm model and the automatic marking result based on business rule Reason;On the basis of automatic marking fusion treatment result, according to mark traffic criteria, manually modifies, confirms to annotation results And preservation.
7. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: corpus of text mark is quasi- Standby module is selected and is managed to separate sources corpus, and corpus of text to be marked is saved as by different labeled task, i.e., raw corpus; In semi-automatic corpus participle labeling module, corresponding participle mark task is created, and select applicable dimensioning algorithm model, Automatic pre- mark is carried out to participle task corpus based on selected algorithm model, meanwhile, for the particularity in field locating for data, Graduation related service rule carries out the automatic pre- mark based on business rule, is melted using ballot method to two class annotation results It closes.
8. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: semi-automation participle language Expect that labeling module carries out model training and update by reaction type model learning training module, to existing model used in mark The training of reaction type model learning is carried out, or model is reinforced using external depth and carries out the training of reaction type model learning;Setting participle Marking model parameter;Idiom material needed for selecting participle model training carries out model learning training.
9. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: in participle model training In processing stream: model corpus selecting module chooses the corpus for doing participle model training, selects CRF, JIEBA, BI-LSTM points Word algorithm is trained, and is called participle training pattern interface Train, is generated core lexicon and N-gram core lexicon, make model Accuracy reaches best;Judge whether to save participle model, it can training algorithm to CRF, BI-LSTM using corpus data has been marked Off-line training is carried out, external algorithm model is imported by unified participle training pattern access interface, model is updated or is exported, The participle model file comprising core lexicon and N-gram lexicon file is saved, and updates participle training pattern table.
10. semi-automatic participle corpus labeling training device as claimed in claim 9, it is characterised in that: participle model updates Afterwards, start Chinese Word Segmentation Service, select the training of CRF, JIEBA, BI-LSTM segmentation methods, configuration file increases new participle switch, judgement Whether participle model is updated, is then to read specified participle model, obtains participle model title, otherwise read participle training pattern table Core lexicon is carried with algorithm is read, merges dictionary, updates participle training pattern table and sophisticated model, and by updated mark Model feedback carries core lexicon to the algorithm of semi-automatic participle corpus labeling module, using trained model in platform Model for segmenting mark is updated, and completes new participle mark task.
CN201910455093.XA 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device Active CN110287482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455093.XA CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455093.XA CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Publications (2)

Publication Number Publication Date
CN110287482A true CN110287482A (en) 2019-09-27
CN110287482B CN110287482B (en) 2022-07-08

Family

ID=68002801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455093.XA Active CN110287482B (en) 2019-05-29 2019-05-29 Semi-automatic participle corpus labeling training device

Country Status (1)

Country Link
CN (1) CN110287482B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008706A (en) * 2019-12-09 2020-04-14 长春嘉诚信息技术股份有限公司 Processing method for automatically labeling, training and predicting mass data
CN111582388A (en) * 2020-05-11 2020-08-25 广州中科智巡科技有限公司 Method and system for quickly labeling image data
CN111597807A (en) * 2020-04-30 2020-08-28 腾讯科技(深圳)有限公司 Method, device and equipment for generating word segmentation data set and storage medium thereof
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method
CN112101014A (en) * 2020-08-20 2020-12-18 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN113206854A (en) * 2021-05-08 2021-08-03 首约科技(北京)有限公司 Method and device for rapidly developing national standard terminal protocol

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN107622050A (en) * 2017-09-14 2018-01-23 武汉烽火普天信息技术有限公司 Text sequence labeling system and method based on Bi LSTM and CRF
CN108256029A (en) * 2018-01-11 2018-07-06 北京神州泰岳软件股份有限公司 Statistical classification model training apparatus and training method
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
US20190073447A1 (en) * 2017-09-06 2019-03-07 International Business Machines Corporation Iterative semi-automatic annotation for workload reduction in medical image labeling
CN109446369A (en) * 2018-09-28 2019-03-08 武汉中海庭数据技术有限公司 The exchange method and system of the semi-automatic mark of image
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
US20190073447A1 (en) * 2017-09-06 2019-03-07 International Business Machines Corporation Iterative semi-automatic annotation for workload reduction in medical image labeling
CN107622050A (en) * 2017-09-14 2018-01-23 武汉烽火普天信息技术有限公司 Text sequence labeling system and method based on Bi LSTM and CRF
CN108256029A (en) * 2018-01-11 2018-07-06 北京神州泰岳软件股份有限公司 Statistical classification model training apparatus and training method
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text
CN109446369A (en) * 2018-09-28 2019-03-08 武汉中海庭数据技术有限公司 The exchange method and system of the semi-automatic mark of image
CN109508453A (en) * 2018-09-28 2019-03-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Across media information target component correlation analysis systems and its association analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
侯超: "基于自然语言处理的策略生成系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 December 2013 (2013-12-15), pages 138 - 1720 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN111008706A (en) * 2019-12-09 2020-04-14 长春嘉诚信息技术股份有限公司 Processing method for automatically labeling, training and predicting mass data
CN111008706B (en) * 2019-12-09 2023-05-05 长春嘉诚信息技术股份有限公司 Processing method for automatically labeling, training and predicting mass data
CN111597807A (en) * 2020-04-30 2020-08-28 腾讯科技(深圳)有限公司 Method, device and equipment for generating word segmentation data set and storage medium thereof
CN111597807B (en) * 2020-04-30 2022-09-13 腾讯科技(深圳)有限公司 Word segmentation data set generation method, device, equipment and storage medium thereof
CN111582388A (en) * 2020-05-11 2020-08-25 广州中科智巡科技有限公司 Method and system for quickly labeling image data
CN112101014A (en) * 2020-08-20 2020-12-18 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
CN112036178A (en) * 2020-08-25 2020-12-04 国家电网有限公司 Distribution network entity related semantic search method
CN113206854A (en) * 2021-05-08 2021-08-03 首约科技(北京)有限公司 Method and device for rapidly developing national standard terminal protocol
CN113206854B (en) * 2021-05-08 2022-12-13 首约科技(北京)有限公司 Method and device for rapidly developing national standard terminal protocol

Also Published As

Publication number Publication date
CN110287482B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN110287482A (en) Semi-automation participle corpus labeling training device
CN109359293B (en) Mongolian name entity recognition method neural network based and its identifying system
CN107330011B (en) The recognition methods of the name entity of more strategy fusions and device
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN104050160B (en) Interpreter's method and apparatus that a kind of machine is blended with human translation
CN110287481A (en) Name entity corpus labeling training system
CN109918680A (en) Entity recognition method, device and computer equipment
CN108304372A (en) Entity extraction method and apparatus, computer equipment and storage medium
CN110298033A (en) Keyword corpus labeling trains extracting tool
CN110298032A (en) Text classification corpus labeling training system
CN109471793B (en) Webpage automatic test defect positioning method based on deep learning
CN101539907A (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN109858041A (en) A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries
CN104573028A (en) Intelligent question-answer implementing method and system
CN103823857B (en) Space information searching method based on natural language processing
CN110866093A (en) Machine question-answering method and device
CN109857846B (en) Method and device for matching user question and knowledge point
CN108052499A (en) Text error correction method, device and computer-readable medium based on artificial intelligence
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN113535917A (en) Intelligent question-answering method and system based on travel knowledge map
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN111680512B (en) Named entity recognition model, telephone exchange extension switching method and system
CN113204967B (en) Resume named entity identification method and system
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN111858842A (en) Judicial case screening method based on LDA topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant