CN107329951A

CN107329951A - Build name entity mark resources bank method, device, storage medium and computer equipment

Info

Publication number: CN107329951A
Application number: CN201710447680.5A
Authority: CN
Inventors: 秦兴德; 秦祎晗; 刘奕慧; 郭玮
Original assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Current assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2017-11-07

Abstract

Name entity mark resources bank method, device, storage medium and computer equipment are built the present invention relates to one kind.The present invention uses the resources bank that does not mark text composition epicycle iteration of a small amount of seed bank with not marking in text set to be calculated, the average utility value of each name entity in text is not marked by calculating, generate the seed bank of next round iteration, the seed bank of generation and other resources banks for not marking text composition next round iteration are calculated again to the seed bank of next round again, calculate always like this until will not mark text all calculating, it was found that new name entity, and generate name entity mark resources bank.This method calculates simple, and the confidence level of acquired results is high, is adapted to the extensive text of processing.Text data is a kind of unstructured data, generally all relatively difficult to unstructured data progress recruitment evaluation, and this method can be realized and carry out quantitative evaluation to text name entity.

Description

Build name entity mark resources bank method, device, storage medium and computer equipment

Technical field

The present invention relates to technical field of information processing, more particularly to it is a kind of build name entity mark resources bank method, Device, storage medium and computer equipment.

Background technology

Name entity (named entity) just refers to name, mechanism name, place name and other are all with entitled mark Entity, the name entity of broad sense also includes numeral, date, currency, address etc..Name Entity recognition (Named Entity Recognition, NER) it is one of basic technology of natural language processing, for improving many natural language processing application systems Performance all play an important role.Current NER mainly uses statistical model as treatment technology, such as hidden Markov model (Hidden Markov Model, HMM), conditional random field models (Conditional Random Field, CRF) etc. are counted Model, this kind of statistical model is required for substantial amounts of mark resources bank as training set, typically frequently with People's Daily's language material resource The resources bank that storehouse etc. is manually marked is as training set.Resource in these resources banks manually marked is very limited amount of, deficiency To adapt to large-scale application scene such as machine translation, and with the development of society, constantly there is new name entity to be born, than Such as mechanism name, movie name, name of product, book name, so can not much meet life using the resources bank manually marked The demand of name Entity recognition.Therefore, set up and safeguard that name entity mark resources bank is numerous natural language processing field applications The core of (such as search system, machine translation system etc.).

The content of the invention

Based on this, it is necessary to build name entity mark resources bank method, dress there is provided one kind for above-mentioned technical problem Put, storage medium and computer equipment.

One kind builds name entity mark resources bank method, and methods described includes：

Acquisition has marked text set as the seed bank of epicycle iteration, and the text set that marked includes having marked text；

Acquisition does not mark text set, the text set that do not mark includes not marking text, and text set is not marked from described That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with the seed bank；

The average utility value of each name entity in text is not marked described in calculating；

To the average utility value according to being ranked up from big to small, the name entity of predetermined number in the top is obtained It is used as candidate word；

The text comprising the candidate word and value of utility maximum is selected to be added in the seed bank as next round iteration Seed bank, then from it is described do not mark text set in choose predetermined number do not mark text and the seed bank constitute it is described under The resources bank of one wheel iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain mark money Source storehouse；

Candidate word in the mark resources bank is scored；

The corresponding text for including the candidate word of candidate word for being scored above given threshold is obtained, the text is constituted Set be used as name entity mark resources bank.

In one of the embodiments, the average utility value of each name entity in text is not marked described in the calculating, Including：

Participle is carried out to the text that do not mark in the resources bank, obtains not marking text after participle；

Using the mark text in resources bank described in condition random field CRF model trainings, forecast model is obtained, using pre- The annotated sequence for not marking text surveyed in resources bank described in model prediction, is obtained from the annotated sequence for not marking text Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability；

Text is not marked to each, is not marked according to the conditional probability is calculated by effect assessment function in text Each name entity value of utility；

Obtain each name entity comprising the name entity do not mark text in value of utility, according to the effectiveness Value calculates the average utility value of each name entity.

In one of the embodiments, before the acquisition has marked seed bank of the text set as epicycle iteration, also wrap Include：

Gather text message；

The text message of predetermined number is chosen from the text message of the collection, to the text message of the predetermined number In name entity be labeled, generation, which has been marked, remaining in text set, the text message of the collection does not mark text structure Into not marking text set.

In one of the embodiments, the effect assessment function is

WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence Conditional probability, x is a text marking sequence sample.

In one of the embodiments, the average utility calculation formula is

Wherein X_tIt is the sample set containing entity candidate word t, | X_t| it is the number containing entity candidate word t samples,It is T is in X for entity candidate word_tAverage utility value on sample set, x_tIt is a text marking sequence sample containing entity candidate word t.

One kind builds name entity mark resources bank device, and described device includes：

Seed bank acquisition module, text set has been marked as the seed bank of epicycle iteration for obtaining, described to have marked text This collection includes having marked text；

Resources bank acquisition module, text set is not marked for obtaining, and the text set that do not mark includes not marking text, from It is described not mark the resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set；

Average utility value computing module, for calculating the average utility value for not marking each name entity in text；

Entity candidate word acquisition module is named, for, according to being ranked up from big to small, being obtained to the average utility value The name entity of predetermined number in the top is used as candidate word；

Mark resources bank generation module, for select include the maximum text of the candidate word and value of utility be added to it is described In seed bank as the seed bank of next round iteration, then from it is described do not mark text set in choose predetermined number and do not mark text The resources bank of the next round iteration is constituted with the seed bank, until by it is described do not mark in text set all do not mark text This whole iteration, obtains marking resources bank；

Candidate word grading module, for scoring the candidate word in the mark resources bank；

Entity mark resources bank generation module is named, the candidate word that given threshold is scored above for obtaining corresponding is included The text of the candidate word, the set that the text is constituted is used as name entity mark resources bank.

In one of the embodiments, the average utility value computing module includes：

Word-dividing mode, for carrying out participle to the text that do not mark in the resources bank, obtains not marking text after participle This；

Conditional probability computing module, for using the text of mark in resources bank described in condition random field CRF model trainings This, obtains forecast model, predicts the annotated sequence for not marking text in the resources bank using forecast model, is not marked from described Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability are obtained in the annotated sequence of explanatory notes sheet；

Value of utility computing module, for not marking text to each, effect assessment function is passed through according to the conditional probability The value of utility of each name entity in text is not marked described in calculating；

Average utility value acquisition module, text is not being marked for obtaining each name entity comprising the name entity In value of utility, the average utility value of each name entity is calculated according to the value of utility.

In one of the embodiments, described device also includes：

Text message acquisition module, for gathering text message；

Text message sort module, the text message for choosing predetermined number from the text message of the collection is right Name entity in the text message of the predetermined number is labeled, and generation has marked text set, the text envelope of the collection The remaining text composition that do not mark does not mark text set in breath.

A kind of computer-readable recording medium, is stored thereon with computer program, and the program is realized when being executed by processor Following steps：

Candidate word in the mark resources bank is scored；

A kind of computer equipment, the computer equipment includes memory, processor and is stored on the memory simultaneously The computer program that can be run on the processor, following steps are realized described in the computing device during computer program：

Candidate word in the mark resources bank is scored；

Above-mentioned structure name entity mark resources bank method, device, storage medium and computer equipment, will mark text Collect the seed bank as epicycle iteration, then the text that do not mark for the predetermined number not marked in text set is constituted this with seed bank Take turns the resources bank of iteration.The average utility value for not marking in text each name entity is calculated, to average utility value according to from big It is ranked up to small, the name entity for obtaining predetermined number in the top is used as candidate word.Select again comprising candidate word and effect It is added to the seed bank in seed bank as next round iteration with the maximum text of value, then never chooses default in mark text set Quantity does not mark the resources bank that text and seed bank constitute next round iteration, all is not marked until by do not mark in text set This whole iteration of explanatory notes, obtain marking resources bank.Finally the candidate word in mark resources bank is scored, acquisition is scored above The corresponding text for including candidate word of candidate word of given threshold, the set that text is constituted is used as name entity mark resource Storehouse.The present invention uses the resources bank that does not mark text composition epicycle iteration of a small amount of seed bank with not marking in text set to be counted Calculate, generate next round iteration seed bank, then by the seed bank of generation and other do not mark text constitute next round iteration money Source storehouse is calculated again the seed bank of next round, is calculated until will not mark text all calculating, is found new always like this Name entity, and generate name entity mark resources bank.This method realize simple, speed it is fast, can large scale deployment, can be with It is unlimited to expand the scale that name entity marks resources bank, meet various scene demands.

Brief description of the drawings

Fig. 1 is the flow chart of structure name entity mark resources bank method in one embodiment；

Fig. 2 is the flow chart of structure name entity mark resources bank method in one embodiment；

Fig. 3 is the flow chart of structure name entity mark resources bank method in one embodiment；

Fig. 4 is the structural representation of structure name entity mark resources bank device in one embodiment；

Fig. 5 is the structural representation of average utility computing module in Fig. 4；

Fig. 6 is the structural representation of structure name entity mark resources bank device in one embodiment.

Embodiment

In order to facilitate the understanding of the purposes, features and advantages of the present invention, below in conjunction with the accompanying drawings to the present invention Embodiment be described in detail.Many details are elaborated in the following description to fully understand this hair It is bright.But the invention can be embodied in many other ways as described herein, those skilled in the art can be not Similar improvement is done in the case of running counter to intension of the present invention, therefore the present invention is not limited to the specific embodiments disclosed below.

Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.Each technical characteristic of above example can carry out arbitrary group Close, to make description succinct, combination not all possible to each technical characteristic in above-described embodiment is all described, however, As long as contradiction is not present in the combination of these technical characteristics, the scope of this specification record is all considered to be.

In one embodiment, name entity mark resources bank method is built there is provided one kind as shown in Figure 1, including：

Step 110, obtain and marked text set as the seed bank of epicycle iteration, having marked text set includes having marked text This.

Internet text message is gathered first with crawlers, such as news, comment etc. are used as source material storehouse.So Afterwards, the selected part text in source material storehouse, entity mark is named to it using the mode manually marked.Using a small amount of Text be named entity and manually mark, save human cost, these texts marked, which are constituted, has marked text set.Example Such as, there are 1000 text messages in source material storehouse, choose 100 text messages and manually marked.The text structure marked Into text collection, seed bank of the text collection as epicycle iteration will have been marked.Manually mark refers to in text name entity Word belong to which kind of name entity mark out come, for example, to " calf is found in June, 2013 online." this sentence enters Pedestrian's work is marked, and annotation results are：(calf is online, organization names), it is found in (in June, 2013, time).By in this sentence " calf is online " be labeled as " organization names ", will " in June, 2013 " be labeled as " time ".Certainly, the text in firsthand information storehouse This information can also be other quantity.

Step 120, obtain and do not mark text set, text set is not marked to be included not marking text, is never marked in text set That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with seed bank.

Removed from source material storehouse and marked text set, remaining just constitute does not mark text set.Never text is marked This concentration chooses predetermined number and does not mark the resources bank that text constitutes epicycle iteration together with seed bank.For example, having 1000 Bar source material has carried out artificial mark in storehouse to 100 text messages, constitutes seed bank, and remaining 900 do not mark text Collection.Epicycle from this 900 do not mark text set in choose 1/9 text i.e. 100 text messages, epicycle is constituted together with seed bank The resources bank of iteration.It is of course also possible to choose the text of other ratios.

Step 130, the average utility value for not marking each name entity in text is calculated.

First, participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.

Participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.Can be using for example most Big matching process, HMM (Hidden Markov Model, hidden Markov model) method etc. carry out participle to not marking text. For example, to text, " calf is ranked the first in South China online." participle is carried out, obtain that " calf exists online after participle cutting Rank the first South China ".

Secondly, using the mark text in condition random field CRF model training resources banks, forecast model is obtained, utilized Forecast model prediction resources bank in the annotated sequence for not marking text, never mark text annotated sequence in obtain it is optimal and Suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability.Condition random field algorithm is natural language processing in recent years One of conventional algorithm in field, is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..With CRF models to above-mentioned epicycle Each in the resources bank of iteration does not mark text and is trained, and obtains not marking the mark sequence after text is labeled to each Row.Obtain the text marking sequence of each optimal and suboptimum for not marking text, and calculate do not mark each the optimal of text and The conditional probability of suboptimum text marking sequence.

Again, text is not marked to each, calculated and do not marked in text by effect assessment function according to conditional probability The value of utility of each name entity.Finally, obtain each name entity comprising name entity do not mark text in effectiveness Value, the average utility value of each name entity is calculated according to value of utility.

Step 140, the name of predetermined number in the top is obtained according to being ranked up from big to small to average utility value Entity is used as candidate word.

The average utility value calculated is ranked up according to order from big to small, the name of predetermined number is real before obtaining Body is used as name entity candidate word.For example, that obtain can be the name entity candidate for naming entity to be used as epicycle of top 10 Word, such as be " calf is online, Tsing-Hua University, Baidu, Alibaba, big boundary, unmanned plane, intelligent robot, glasses, cosmetics, RMB ".

Step 150, the text comprising candidate word and value of utility maximum is selected to be added in seed bank as next round iteration Seed bank, then never choose predetermined number in mark text set do not mark the money that text and seed bank constitute next round iteration Source storehouse, until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank.

To each name entity candidate word, selected in the text message of the resources bank of epicycle iteration real comprising the name The text of body candidate word, and selected from the text set comprising the name entity candidate word so that the name entity candidate word is at this The text of value of utility maximum in text.The maximum text of the corresponding effectiveness of each name entity candidate word is added to seed bank The middle seed bank as next round iteration.That never chooses predetermined number in mark text set again does not mark text and seed bank structure Into the resources bank of next round iteration, until by do not mark in text set it is all do not mark the whole iteration of text, obtain mark money Source storehouse.Expansion seed bank is carried out using the text that do not mark in internet, can infinitely expand name entity mark resources bank Scale, meets various scene demands.

For example, in above-mentioned resources bank the remaining quantity for not marking text be 800, then next round iteration just from this 800 Individual do not mark chooses 100 and does not mark text again in text, text and last round of obtained seed bank structure are not marked by this 100 Into the resources bank of epicycle iteration.Carry out design conditions probability, value of utility and average utility value etc., until select include candidate word and The maximum text of value of utility is added to the seed bank in seed bank as next round iteration.Text is not marked from remaining 700 again 100 are selected in this, the resources bank that text constitutes epicycle iteration with last round of obtained seed bank is not marked by this 100.Such as This iterative cycles so far terminates up to not marking the whole iteration of text by remaining, and what is finally given is mark resources bank.

Step 160, the candidate word in mark resources bank is scored.

The name entity candidate word in mark resources bank is commented with scoring formula in actual name Entity recognition Point, obtain appraisal result.Scoring formula be：

WhereinIt is identified as naming the frequency of entity part in the sample for entity candidate word t.N_tFor entity candidate Total frequency that word t occurs in language material, language material includes name entity part and generic word part.Name entity part is language material In be considered as name entity part, generic word part be language material in be not considered as name entity part.Language material, leads to Often it is practically impossible to observe large-scale language example in statistics natural language processing.Typically a text set is collectively referred to as For corpus (Corpus), when having several such text collections, commonly referred to as corpus set (Corpora).

Step 170, the corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, text is constituted Set be used as name entity mark resources bank.

Threshold value is set to scoring, scoring is ranked up from big to small, the name entity for being scored above given threshold is obtained Candidate word, the text for including the name entity candidate word is obtained further according to name entity candidate word from mark resources bank.These texts The set of this composition is name entity mark resources bank.

In the present embodiment, seed bank of the text set as epicycle iteration will have been marked；It is default in text set by not marking The resources bank for not marking text and seed bank composition epicycle iteration of quantity.Calculate not marking and the flat of entity is each named in text Equal value of utility, to average utility value according to being ranked up from big to small, the name entity for obtaining predetermined number in the top is made For candidate word.Select again and be added to seed in seed bank as next round iteration comprising the maximum text of candidate word and value of utility Storehouse, then never choose predetermined number in mark text set do not mark the resources bank that text and seed bank constitute next round iteration, Until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank.Finally to mark resources bank In candidate word scored, obtain and be scored above the corresponding text for including candidate word of candidate word of given threshold, by text The set of composition is used as name entity mark resources bank.The present invention is using a small amount of seed bank with not marking not marking in text set The resources bank that text constitutes epicycle iteration is calculated, the seed bank of generation next round iteration, then by the seed bank of generation and its He, which does not mark text and constitutes the resources bank of next round iteration, is calculated again the seed bank of next round, is calculated always like this directly To will not mark text all calculating, new name entity is found, and generate name entity mark resources bank.This method is realized Simply, speed it is fast, can large scale deployment, can infinitely expand the scale that name entity marks resources bank, meet various scene need Ask.

In one embodiment, as shown in Fig. 2 each name entity candidate word is including name entity in computing resource storehouse Text set in average utility value, including：

Step 131, participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.

Participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.Can be using for example most Big matching process, HMM (Hidden Markov Model, hidden Markov model) method etc. carry out participle to not marking text. Maximum matching process belongs to mechanical segmentation method, is the Chinese character string and one " fully big " being analysed to according to certain strategy Entry in machine dictionary is matched, if finding some character string in dictionary, and the match is successful identifies a word.It is hidden Markov model embodies very big value in fields such as speech recognition, natural language processing and biological informations.To current Untill, it is considered as to realize most successful side during quick accurate speech recognition system and natural language processing always Method.For example, to text, " calf is ranked the first in South China online." participle is carried out, obtain that " calf is online after participle cutting Ranked the first in South China ".

Step 133, using the mark text in condition random field CRF model training resources banks, forecast model is obtained, profit Predicted with forecast model in the annotated sequence for not marking text in resources bank, the annotated sequence for never marking text and obtain optimal And suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability.

Using the mark in CRF (Conditional Random Field, condition random field) model training resources bank Text, obtains forecast model, and the annotated sequence for not marking text in resources bank is predicted using forecast model.Utilize forecast model One annotated sequence for not marking text, which is predicted, can produce multiple different annotated sequences.From this multiple annotated sequence The annotated sequence of the optimal and suboptimum of each text is obtained, and calculates optimal and suboptimum text marking sequence the bar of each text Part probability.Calculate the conditional probability of the text marking sequence of optimal and suboptimum:WithIts InIt is optimal and suboptimum annotated sequence, θ is model parameter, and x is a text marking sequence sample.

Condition random field is one of algorithm that natural language processing field is commonly used in recent years, is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..Each text in the resources bank of above-mentioned epicycle iteration is trained with CRF models, obtained Text marking sequence after being labeled to each text.

For example, using forecast model to the mark for not marking text " calf is ranked the first in South China online " after participle Note sequence is predicted, and possible annotation results and conditional probability are：

[(calf is online, mechanism name, 0.9), (South China, place name, 0.89)],

[(calf is online, place name, 0.09), (South China, time, 0.02)],

[(calf is online, time, 0.01), (South China, mechanism name, 0.09)] etc..The optimal mark of " calf is online " Sequence is (calf is online, mechanism name, 0.9), and " calf is online " suboptimum annotated sequence is (calf is online, place name, 0.09).I.e.For 0.9,For 0.09.

Step 135, text is not marked to each, calculated and do not marked in text by effect assessment function according to conditional probability Each name entity value of utility.

The formula of effect assessment function is：

Text is not marked to each, according in the conditional probability above calculated, is calculated and not marked using effect assessment function The value of utility of each name entity in explanatory notes sheet.For example, " calf is online in South China's ranking for the above-mentioned text that do not mark There are 2 name entity candidate words in one ", one is " calf is online ", and one is " South China ", wherein " calf is online " is most Excellent annotated sequence is (calf is online, mechanism name, 0.9), " calf is online " suboptimum annotated sequence be (calf is online, place name, 0.09).I.e.For 0.9,For 0.09, pass through effect assessment function and calculate " calf is online " It is in the value of utility for not marking text：1- (0.9- (1-0.5) × 0.09)=0.145, wherein taking λ to be 0.5.Similarly to " China Southern area " is calculated in the value of utility in not marking text.Then successively text is not marked to others again, calculates each life The value of utility of name entity.

Step 137, obtain each name entity candidate word comprising name entity do not mark text in value of utility, root The average utility value of each name entity is calculated according to value of utility.

Average utility calculation formula is：

Wherein X_tIt is the sample set containing entity candidate word t, | X_t| it is the number containing entity candidate word t samples, It is entity candidate word t in X_tAverage utility value on sample set, x_tIt is a text marking sequence sample containing entity candidate word t Example.

By the value of utility of the above-mentioned each name entity calculated, averaged by average utility calculation formula, Obtain the average utility value of each name entity candidate word.In the present embodiment, it is proposed that one kind is defeated using CRF model trainings Optimal and suboptimum annotated sequence the conditional probability of each text gone out, text is not marked to each, according to above calculating Conditional probability, the value of utility for not marking each name entity in text is calculated using effect assessment function.Obtain again each Name entity candidate word comprising name entity do not mark text in value of utility, each name entity is calculated according to value of utility Average utility value.

In one embodiment, as shown in figure 3, obtaining before having marked seed bank of the text set as epicycle iteration, also Including：

Step 180, text message is gathered.

Before acquisition has marked seed bank of the text set as epicycle iteration, crawlers are utilized to gather internet text Information, such as news, comment etc. are used as source material storehouse.

Step 190, the text message of predetermined number is chosen from the text message of collection, to the text message of predetermined number In name entity be labeled, generation marked in text set, the text message of collection it is remaining do not mark text constitute not Mark text set.

The selected part text in source material storehouse, entity mark is named to it using the mode manually marked.People This part after work mark has marked text and has constituted mark text set, removes this part in source material storehouse and has marked text set Afterwards, remaining whole does not mark text composition and does not mark text set.

In the present embodiment, a number of text is obtained first with crawlers, it is then artificial to part therein Text has carried out name entity mark, and this part has been marked to text set as the part in the seed bank subsequently trained.This This part of sample, which has marked text, can improve the accuracy of follow-up training result.

In one embodiment, effect assessment function is：

In the present embodiment, initiating effect assessment function is used to calculate effect of each name entity in text marking sequence With value, by the use of the conditional probability of CRF models output as input, this method calculates simple, and the confidence level of acquired results is high, fits Close the extensive text of processing.Text data is a kind of unstructured data, generally carries out recruitment evaluation all to unstructured data It is relatively difficult, and this method can be realized and carry out quantitative evaluation to text name entity.

In one embodiment, average utility calculation formula is：

In the present embodiment, the value of utility using each name entity calculated in text marking sequence, will provided Value of utility of each name entity candidate word in the text set comprising name entity, which is summed up, in the storehouse of source averages, and produces Average utility value is arrived.Similarly, this method calculates simple, workable.

In one embodiment, should as shown in figure 4, additionally providing a kind of name entity that builds marks resources bank device 400 Device includes：Seed bank acquisition module 410, resources bank acquisition module 420, average utility value computing module 430, name entity are waited Select word acquisition module 440, mark resources bank generation module 450, candidate word grading module 460 and name entity mark resources bank life Into module 470.

Seed bank acquisition module 410, has marked text set as the seed bank of epicycle iteration for obtaining, has marked text Collection includes having marked text.

Resources bank acquisition module 420, text set is not marked for obtaining, and text set is not marked to be included not marking text, from The resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set is not marked.

Average utility value computing module 430, does not mark the average utility value of each name entity in text for calculating.

Entity candidate word acquisition module 440 is named, for, according to being ranked up from big to small, acquisition to be arranged to average utility value The name entity of the forward predetermined number of name is used as candidate word.

Resources bank generation module 450 is marked, seed is added to for selecting the text for including candidate word and value of utility maximum That predetermined number is chosen as the seed bank of next round iteration in storehouse, then never in mark text set does not mark text and seed bank Constitute the resources bank of next round iteration, until by do not mark in text set it is all do not mark the whole iteration of text, marked Resources bank.

Candidate word grading module 460, for scoring the candidate word in mark resources bank.

Entity mark resources bank generation module 470 is named, the candidate word that given threshold is scored above for obtaining is corresponding Text comprising candidate word, the set that text is constituted is used as name entity mark resources bank.

In one embodiment, as shown in figure 5, average utility value computing module 430 includes：Word-dividing mode 431, condition are general Rate computing module 433, value of utility computing module 435 and average utility value acquisition module 437.

Word-dividing mode 431, for carrying out participle to the text that do not mark in resources bank, obtains not marking text after participle This.

Conditional probability computing module 433, for using the text of mark in condition random field CRF model training resources banks This, obtains forecast model, predicts the annotated sequence for not marking text in resources bank using forecast model, never marks text Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in annotated sequence.

Value of utility computing module 435, for not marking text to each, effect assessment function meter is passed through according to conditional probability Calculate the value of utility for each name entity not marked in text.

Average utility value acquisition module 437, text is not being marked for obtaining each name entity comprising name entity In value of utility, the average utility value of each name entity is calculated according to value of utility.

In one embodiment, as shown in fig. 6, building name entity mark resources bank device 400 also includes：Text message Acquisition module 480 and text message sort module 490.

Text message acquisition module 480, for gathering text message.

Text message sort module 490, the text message for choosing predetermined number from the text message of collection, to pre- If the name entity in the text message of quantity is labeled, generation has marked remaining in text set, the text message of collection Text composition is not marked does not mark text set.

In one embodiment, a kind of computer-readable recording medium is additionally provided, computer program is stored thereon with, should Following steps are realized when program is executed by processor：Acquisition has marked text set as the seed bank of epicycle iteration, has marked text This collection includes having marked text；Acquisition does not mark text set, and text set is not marked to be included not marking text, never marks text set The middle resources bank for not marking text and seed bank composition epicycle iteration for choosing predetermined number；Calculating, which is not marked in text, each orders The average utility value of name entity；To average utility value according to being ranked up from big to small, predetermined number in the top is obtained Name entity is used as candidate word；Select to be added in seed bank as next round comprising the maximum text of candidate word and value of utility and change The seed bank in generation, then the text that do not mark of selection predetermined number constitutes next round iteration with seed bank never in mark text set Resources bank, until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank；To mark resource Candidate word in storehouse is scored；The corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, by text The set of this composition is used as name entity mark resources bank.

In one embodiment, following steps are also realized when said procedure is executed by processor：To not marking in resources bank Explanatory notes this progress participle, obtains not marking text after participle；Using the mark in condition random field CRF model training resources banks Explanatory notes sheet, obtains forecast model, and the annotated sequence for not marking text in resources bank is predicted using forecast model, never marks text Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in this annotated sequence；Do not marked to each Explanatory notes sheet, the value of utility for each name entity not marked in text is calculated according to conditional probability by effect assessment function；Obtain Take each name entity comprising name entity do not mark text in value of utility, each name entity is calculated according to value of utility Average utility value.In one embodiment, following steps are also realized when said procedure is executed by processor：Gather text envelope Breath；The text message of predetermined number is chosen from the text message of collection, to the name entity in the text message of predetermined number It is labeled, generation has marked the remaining text composition that do not mark in text set, the text message of collection and do not marked text set.

In one embodiment, following steps are also realized when said procedure is executed by processor：Effect assessment function is

In one embodiment, following steps are also realized when said procedure is executed by processor：Average utility calculation formula For

In one embodiment, additionally provide a kind of computer equipment, the computer equipment include memory, processor and The computer program that can be run on a memory and on a processor is stored, following walk is realized during computing device computer program Suddenly：

Acquisition has marked text set as the seed bank of epicycle iteration, and having marked text set includes having marked text；Obtain Text set is not marked, and text set is not marked to be included not marking text, never marks and not marking for predetermined number is chosen in text set Text constitutes the resources bank of epicycle iteration with seed bank；Calculate the average utility value for not marking each name entity in text；It is right Average utility value according to being ranked up from big to small, and the name entity for obtaining predetermined number in the top is used as candidate word；Choosing Go out and be added to seed bank in seed bank as next round iteration comprising the maximum text of candidate word and value of utility, then never mark That predetermined number is chosen in text set does not mark the resources bank that text constitutes next round iteration with seed bank, until will not mark text The all of this concentration do not mark the whole iteration of text, obtain marking resources bank；Candidate word in mark resources bank is scored； The corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, the set that text is constituted is real as name Body marks resources bank.

In one embodiment, following steps are also realized during above-mentioned computing device computer program：To in resources bank Text is not marked and carries out participle, obtains not marking text after participle；Using in condition random field CRF model training resources banks Text has been marked, forecast model is obtained, the annotated sequence for not marking text in resources bank has been predicted using forecast model, never marks Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in the annotated sequence of explanatory notes sheet；To each Text is not marked, calculates the effectiveness for each name entity not marked in text by effect assessment function according to conditional probability Value；Obtain each name entity comprising name entity do not mark text in value of utility, each life is calculated according to value of utility The average utility value of name entity.In one embodiment, following steps are also realized during above-mentioned computing device computer program：Adopt Collect text message；The text message of predetermined number is chosen from the text message of collection, in the text message of predetermined number Name entity is labeled, and generation has marked the remaining text composition that do not mark in text set, the text message of collection and do not marked Text set.

In one embodiment, following steps are also realized during above-mentioned computing device computer program：Effect assessment function For

In one embodiment, following steps are also realized during above-mentioned computing device computer program：Average utility is calculated Formula is

Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that coming for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. one kind builds name entity mark resources bank method, methods described includes：

Acquisition does not mark text set, and the text set that do not mark includes not marking text, from it is described do not mark text set in choose The resources bank for not marking text and seed bank composition epicycle iteration of predetermined number；

To the average utility value according to being ranked up from big to small, the name entity conduct of predetermined number in the top is obtained Candidate word；

Select and be added to kind in the seed bank as next round iteration comprising the maximum text of the candidate word and value of utility Word bank, then from it is described do not mark text set in choose do not mark text and the seed bank of predetermined number and constitute the next round The resources bank of iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank；

Candidate word in the mark resources bank is scored；

Obtain the corresponding text for including the candidate word of candidate word for being scored above given threshold, the collection that the text is constituted Cooperate as name entity mark resources bank.

2. according to the method described in claim 1, it is characterised in that do not mark each name entity in text described in the calculating Average utility value, including：

Using the mark text in resources bank described in condition random field CRF model trainings, forecast model is obtained, using predicting mould Type predicts the annotated sequence for not marking text in the resources bank, obtains optimal from the annotated sequence for not marking text And suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability；

Text is not marked to each, is not marked according to the conditional probability is calculated by effect assessment function every in text The value of utility of individual name entity；

Obtain each name entity comprising the name entity do not mark text in value of utility, according to the value of utility meter Calculate the average utility value of each name entity.

3. according to the method described in claim 1, it is characterised in that the acquisition has marked text set as the kind of epicycle iteration Before word bank, in addition to：

Gather text message；

The text message of predetermined number is chosen from the text message of the collection, in the text message of the predetermined number Name entity is labeled, and generation has marked the remaining text that do not mark in text set, the text message of the collection and constituted not Mark text set.

4. method according to claim 2, it is characterised in that the effect assessment function is：

<mrow> <msub> <mi>U</mi> <mi>M</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mn>1</mn> <mo>-</mo> <mrow> <mo>(</mo> <mi>P</mi> <mo>(</mo> <mrow> <msubsup> <mi>y</mi> <mn>1</mn> <mo>*</mo> </msubsup> <mo>|</mo> <mi>x</mi> <mo>,</mo> <mi>&theta;</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mi>P</mi> <mo>(</mo> <mrow> <msubsup> <mi>y</mi> <mn>2</mn> <mo>*</mo> </msubsup> <mo>|</mo> <mi>x</mi> <mo>,</mo> <mi>&theta;</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, 0≤λ≤1 for regulation because Son,For the conditional probability of x optimal annotated sequence,For the bar of x suboptimum annotated sequence Part probability, x is a text marking sequence sample.

5. method according to claim 2, it is characterised in that the average utility calculation formula is：

<mrow> <mover> <mi>U</mi> <mo>&OverBar;</mo> </mover> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>t</mi> </msub> <mo>|</mo> </mrow> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>t</mi> </msub> <mo>|</mo> </mrow> </munderover> <msub> <mi>U</mi> <mi>M</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>&Element;</mo> <msub> <mi>X</mi> <mi>t</mi> </msub> <mo>,</mo> </mrow>

Wherein X_tIt is the sample set containing entity candidate word t, | X_t| it is the number containing entity candidate word t samples,It is entity Candidate word t is in X_tAverage utility value on sample set, x_tIt is a text marking sequence sample containing entity candidate word t.

6. one kind builds name entity mark resources bank device, it is characterised in that described device includes：

Seed bank acquisition module, text set has been marked as the seed bank of epicycle iteration for obtaining, described to have marked text set Including having marked text；

Resources bank acquisition module, text set is not marked for obtaining, and the text set that do not mark includes not marking text, from described The resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set is not marked；

Entity candidate word acquisition module is named, for, according to being ranked up from big to small, obtaining ranking to the average utility value The name entity of forward predetermined number is used as candidate word；

Resources bank generation module is marked, the seed is added to for selecting the text for including the candidate word and value of utility maximum As the seed bank of next round iteration in storehouse, then from it is described do not mark text set in choose predetermined number and do not mark text and institute State the resources bank that seed bank constitutes the next round iteration, until by it is described do not mark in text set all not mark text complete Portion's iteration, obtains marking resources bank；

Entity mark resources bank generation module is named, the candidate word that given threshold is scored above for obtaining is corresponding comprising described The text of candidate word, the set that the text is constituted is used as name entity mark resources bank.

7. device according to claim 6, it is characterised in that the average utility value computing module includes：

Word-dividing mode, for carrying out participle to the text that do not mark in the resources bank, obtains not marking text after participle；

Conditional probability computing module, for using the mark text in resources bank described in condition random field CRF model trainings, obtaining To forecast model, the annotated sequence for not marking text in the resources bank is predicted using forecast model, text is not marked from described Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability are obtained in this annotated sequence；

Value of utility computing module, for not marking text to each, is calculated according to the conditional probability by effect assessment function The value of utility of each name entity not marked in text；

Average utility value acquisition module, text is not being marked comprising the name entity for obtaining each name entity Value of utility, the average utility value of each name entity is calculated according to the value of utility.

8. device according to claim 6, it is characterised in that described device also includes：

Text message acquisition module, for gathering text message；

Text message sort module, the text message for choosing predetermined number from the text message of the collection, to described Name entity in the text message of predetermined number is labeled, and generation has been marked in text set, the text message of the collection The remaining text composition that do not mark does not mark text set.

9. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor The structure name entity mark resources bank method as any one of power 1 to 5 is realized during row.

10. a kind of computer equipment, the computer equipment includes memory, processor and is stored on the memory and can The computer program run on the processor, it is characterised in that realized described in the computing device during computer program Structure name entity mark resources bank method as any one of weighing 1 to 5.