CN105550227A

CN105550227A - Named entity identification method and device

Info

Publication number: CN105550227A
Application number: CN201510889318.4A
Authority: CN
Inventors: 张晨; 谢隆飞; 尹泓钦; 王全礼
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2016-05-04
Anticipated expiration: 2035-12-07
Also published as: CN105550227B

Abstract

The invention provides a named entity identification method and device. After first entity probability distribution of a train document and second entity probability distribution of a test document are obtained by utilizing an initially constructed first sequence annotation model, features, such as the first context similarity of the train document and the first object similarity of the train document and the second context similarity of the test document and the second object similarity of the test document, can be extracted from social network information; therefore, a second sequence annotation model is obtained by training the first context similarity of the train document and the first object similarity of the train document, such that the second sequence annotation model is more suitable for a social network; and in addition, the named entity identification result, which is obtained by performing sequence annotation of the test document based on the second sequence annotation model suitable for the social network, is more accurate.

Description

A kind of named entity recognition method and device

Technical field

The invention belongs to named entity technical field, in particular, particularly relate to a kind of named entity recognition method and device.

Background technology

Named entity refers to the entity with certain sense, as name Lee three, named entity recognition is then identify the entity in text with certain sense, mainly comprise name, place name, mechanism's name and proper noun etc., these entities identified extract the input of task as follow-up, as can as the input of the information extraction tasks such as Relation extraction, event extraction, fine-grained sentiment analysis, therefore the quality of named entity recognition result directly affects the effect that follow-up extracts task.

Current named entity recognition method has had a lot, and if the patent No. is the recognition methods of 201310201310674046.7, its process is: identify the special word in pending text, model entity in pending text is identified, and with the numeric string preset by identified in pending text be the special word replacement of model entity, then commodity entity is carried out on this basis, commodity classification entity, brand entity, the identification of the entities such as item property name entity and item property value entity, this recognition methods is mainly for general text, and the text mainly short text in social networks, as in microblogging or this social networks of QQ, the text majority that user issues is short text, and user can pay close attention to each other in social networks, but current named entity recognition method is not based on this feature, therefore a kind of named entity recognition method being applicable to microblogging or these social networks of QQ is badly in need of.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of named entity recognition method and device, for carrying out the identification of named entity based on social network information, to be applicable to social networks.Technical scheme is as follows:

The invention provides a kind of named entity recognition method, described method comprises:

Based on the First ray marking model of initial construction, sequence labelling is carried out to Training document and test document, obtain the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;

Obtain the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;

Based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtain the 3rd entity probability distribution of corresponding first word;

Obtain the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;

Based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word;

Based on the 3rd entity probability distribution of each first word, re-training is carried out to described First ray marking model, obtain the second sequence labelling model;

Using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.

Preferably, the second object similarity between the second context similarity of each second word of described acquisition in each self-corresponding Training document and each self-corresponding Training document said target object of each second word, comprising:

Obtain in word bag u and word bag v the total amount of the second word in the quantity of the second word jointly had and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;

Using the ratio of the total amount of the quantity of described second word and described second word as described second context similarity;

Based on the second context similarity of described each Training document, obtain the second object similarity between Training document said target object.

Preferably, the described second context similarity of the described described second instance probability distribution based on each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word, comprising:

Based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;

Named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word;

Based on the probability sum of each name real type of described second word, obtain the named entity probability distribution of described second word in all test document;

Based on the described second object similarity of described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;

Based on the probability sum of described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;

When the named entity type obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution.

Preferably, based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, comprising:

Based on formula

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{' &Element; T}} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

The named entity type obtaining described second word belongs to the probability of named entity class label c, wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T,

The described named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word, comprising:

Based on formula

Z(w，u，u′，S，T)＝∑ _c∈Cp(c|w，u，u′，S，T)＝∑ _c∈C∑ _s∈Sβ(s，u′)·p(c|w，u，u′，s，t)

Obtain the probability sum of each named entity type of described second word, wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ';

The probability sum of described each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document, comprising:

Based on formula

p(c|w，u，u′，S，T)＝∑ _s∈Sβ(s，u′)·p(c|w，u，u′，s，T)/Z(w，u，u′，S，T)

Obtain the named entity probability distribution of described second word in all test document.

Preferably, the described described second object similarity based on described named entity probability distribution and the second word, obtains the probability sum of named entity class label c, comprising:

Based on formula

Obtain the probability sum of described named entity class label c, wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;

The described probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c, comprising:

Based on formula

p(c|w)＝p(c|w，u，U，S，T)＝∑ _u′∈U∑ _s∈S∑ _{t∈Tandt＝w}p(c|w，u，u′，S，T)·p(c|w，u，u′，s，T)

The named entity type obtaining described two words belongs to the probability distribution of named entity class label c;

The described named entity type when obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution, comprising:

Based on formula

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain described 4th entity probability distribution.

The present invention also provides a kind of named entity recognition device, and described device comprises:

First acquiring unit, for the First ray marking model based on initial construction, sequence labelling is carried out to Training document and test document, obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;

Second acquisition unit, for obtaining the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;

3rd acquiring unit, for based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtains the 3rd entity probability distribution of corresponding first word;

4th acquiring unit, for obtaining the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;

5th acquiring unit, for based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtains the 4th entity probability distribution of corresponding second word;

Training unit, for the 3rd entity probability distribution based on each first word, carries out re-training to described First ray marking model, obtains the second sequence labelling model;

Test cell, for using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.

Preferably, described 4th acquiring unit comprises:

First obtains subelement, for obtaining the total amount of the second word in the quantity of the second word jointly had in word bag u and word bag v and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;

Second obtains subelement, for the ratio of the total amount using the quantity of described second word and described second word as described second context similarity;

3rd obtains subelement, for the second context similarity based on described each Training document, obtains the second object similarity between Training document said target object.

Preferably, described 5th acquiring unit comprises:

First probability obtains subelement, for based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;

First probability and acquisition subelement, for belonging to the probability of named entity class label c based on the named entity type of described second word, obtain the probability sum of each named entity type of described second word;

Second probability obtains subelement, for the probability sum of each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document;

Second probability and acquisition subelement, for the described second object similarity based on described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;

3rd probability obtains subelement, and for the probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;

4th probability obtains subelement, for belonging to the probability distribution of different named entity class label in named entity recognition class tag set C in the named entity type obtaining the second word, choosing the maximum probability distribution of value is described 4th entity probability distribution.

Preferably, described first probability obtains subelement, for based on formula

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{' &Element; T}} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

Described first probability and obtain subelement, for based on formula

Described second probability obtains subelement, for based on formula

Preferably, described second probability and obtain subelement, for based on formula

Described 3rd probability obtains subelement, for based on formula

Described 4th probability obtains subelement, for based on formula

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain described 4th entity probability distribution.

Compared with prior art, technique scheme tool provided by the invention has the following advantages:

The technique scheme that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of the named entity recognition method that the embodiment of the present invention provides;

Fig. 2 is the sub-process figure of the named entity recognition method that the embodiment of the present invention provides;

Fig. 3 is the structural representation of the named entity recognition device that the embodiment of the present invention provides;

Fig. 4 is the structural representation of the 5th acquiring unit in the named entity recognition device that provides of the embodiment of the present invention.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Referring to Fig. 1, it illustrates the process flow diagram of the named entity recognition method that the embodiment of the present invention provides, for identifying the identification of the named entity of each word in each test document in social networks, specifically can comprise the following steps:

101: based on the First ray marking model of initial construction, sequence labelling is carried out to Training document and test document, obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document.

In embodiments of the present invention, First ray marking model is a kind of model that current named entity recognition is commonly used, as condition random field, can be obtained the entity probability distribution of a word in a document by condition random field, such as set X as observation sequence stochastic variable, Y is status switch stochastic variable, x is a document, y is the named entity annotated sequence of the correspondence of document x, and entity probability distribution P (Y|X) is condition random field, and its parameterized form is:

p(Y＝y|X＝x)＝exp(Σ _jλ _jt _j(y _i-1，x，i)+Σ _kμ _ks _k(y _i，x，i))

Wherein tj (yi-1, yi, x, i) be the i-th position of whole observation sequence and sequence label and the transfer characteristic function of the i-th-1 position, sk (yi, x, i) is the variable of the i-th position of sequence label and the status flag function of whole observation sequence, λ j and μ k is parameter to be estimated, has the corpus of mark to estimate λ j and μ k when model training by using.Word each in Training document and test document is updated in above-mentioned formula, the second instance probability distribution of each second word in the first instance probability distribution of each first word in Training document and each test document can be obtained.

Here it should be noted is that: stochastic variable field is an existing sequence labelling model, its formula obtaining entity probability distribution is an existing computing formula, how those skilled in the art known obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in Training document and each test document from above-mentioned formula if being, the embodiment of the present invention is no longer described in detail it.

102: obtain the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word.

103: based on the first object similarity of the first instance probability distribution of each first word, the first context similarity of each first word and each first word, obtain the 3rd entity probability distribution of corresponding first word.

104: obtain the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word.

105: based on the second object similarity of the second instance probability distribution of each second word, the second context similarity of each second word and each second word, obtain the 4th entity probability distribution of corresponding second word.

In embodiments of the present invention, first context similarity is used to indicate the similarity between Training document, first object similarity is then used to indicate the similarity between Training document said target object, the second same context similarity is used to indicate the similarity between test document, and the second object similarity is used for the similarity between test document said target object.Generally, being arranged in each destination object of social networks may pay close attention to each other, and the document that both sides issue also may be correlated with, and therefore the first context similarity and the first object similarity can be extracted as the feature in social network information.

And the first context similarity is identical with the acquisition process of the second context similarity in embodiments of the present invention, and the first object similarity is also identical with the acquisition process of the second object similarity, when the acquisition process of these two similarities is identical, 3rd entity probability distribution of Training document is also identical with the computation process of the 4th entity probability distribution of test document, in embodiments of the present invention then for test document, be described, first introduce the acquisition process of the second context similarity and the second object similarity.

In second context similarity and the second object similarity are measured in social networks, the method for measuring similarity that the present invention uses is: Jaccard similarity and cosine similarity.Jaccard similarity is the second context similarity of the embodiment of the present invention, it needs to measure the test document of similarity when measuring two and regards the word bag set of all words (in the test document) as, according to the ratio of the quantity of the second word altogether occurred in the quantity of the second word of appearance common in two word bags and two word bags as Jaccard similarity.If the word bag of two test document is respectively u, v, then the Jaccard similarity of u and v can be defined as:

J a c c a r d (u, v) = \frac{| u \cap v |}{| u \cup v |}

The span of Jaccard similarity is [0,1], and the similarity between two test document is directly proportional to the size of Jaccard similarity.When two test document are completely uncorrelated, namely there is no identical word between two test document, then Jaccard (u, v)=0; If two test document is identical, then now Jaccard (u, v)=1.

Then represent with cosine similarity for the second object similarity, cosine similarity be then need measure similarity two test document vectorizations after, calculate the similarity between two vectors.Cosine formula is used to calculate:

C o s i n e (u, v) = \frac{Σ_{i} v_{i} \cdot u_{i}}{\sqrt{Σ_{i} v_{i}^{2}} \cdot \sqrt{Σ_{i} u_{i}^{2}}}

The span of cosine similarity is [-1,1].Similarity size between vector and the size of cosine similarity proportional.When two vector direction are completely contrary, Cosine (u, v)=-1; When two vectors are mutually vertical, time namely angle is 90 °, Cosine (u, v)=0; Direction between two vectors is identical, Cosine (u, v)=1.But for text vector, in vector space model, there will not be negative.So the span of cosine similarity is [0,1] in the middle of vector space model.

Why choose Jaccard similarity, because the length of the test document in current social networks is shorter, and Jaccard similarity is relative to other similarities, more be applicable to the shorter document of length, therefore in the embodiment of the present invention, select Jaccard similarity as the computing method of the first context similarity and the second context similarity.

Based on the acquisition process of above-mentioned second context similarity and the second object similarity, the computation process of the 4th entity probability distribution of test document as shown in Figure 2, can comprise the following steps:

201: based on the second instance probability distribution of the second word and the second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type.Concrete, can based on following formula:

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{' &Element; T}} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

The named entity type obtaining the second word belongs to the probability of named entity class label c.

Wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T.

And from above-mentioned based on formula, the probability that the named entity type of the second word belongs to named entity class label c is when a given second word w, said target object u, non-targeted object u ', test document s and each the second word w are to entity class distributed collection T, the probability of the second word w.

202: the named entity type based on the second word belongs to the probability of named entity class label c, obtain the probability sum of each named entity type of the second word.Concrete, can based on formula:

Obtain the probability sum of each named entity type of the second word, namely Z to represent in named entity recognition class tag set C each named entity class label c at given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to probability sum when entity class distributed collection T.

Wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ', its can by test document and the coupling one by one of non-targeted object obtain.

203: based on the probability sum of each name real type of the second word, obtain the named entity probability distribution of the second word in all test document.Concrete, can based on formula:

Obtain the named entity probability distribution of the second word in all test document.Namely for given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to entity class distributed collection T, the named entity probability distribution of its named entity class label c can be expressed as: for each the test document s in the middle of test document S set, when given second word w, said target object u, non-targeted object u ' and each the second word w are to entity class distributed collection T, the probability sum of named entity class label c is divided by a normalized factor Z.

204: based on the second object similarity of named entity probability distribution and the second word, obtain the probability sum of named entity class label c.Concrete, can based on formula:

Obtain the probability sum of named entity class label c.Wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;

205: based on the probability sum of named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c.Concrete, can based on formula:

The named entity type obtaining described two words belongs to the probability distribution of named entity class label c.

206: when the named entity type obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is the 4th entity probability distribution.Concrete, can based on formula:

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain the 4th entity probability distribution.

Namely for each second word, the probability distribution of any one named entity class label in named entity recognition class tag set C can be obtained by above-mentioned formula, after the probability distribution obtaining whole named entity class label, therefrom choose the four entity probability distribution of maximum probability distribution as the second word.

There are four named entity class labels in such as named entity recognition class tag set C, then can obtain based on above-mentioned formula _twofour probability distribution p (c|w, u, U, S, T) of word w, the corresponding named entity class label of each probability distribution p (c|w, u, U, S, T), therefrom choose the probability distribution p (c|w, u, U, S, T) of maximum probability as the 4th entity probability distribution.

Accordingly, when calculating the 3rd entity probability distribution, be then based on Training document, Training document said target object, non-targeted object, named entity recognition class tag set C and entity class distributed collection T, calculate based on above-mentioned formula.And the value of above-mentioned smoothing factor θ is less, can not affect the result of calculation of above-mentioned formula like this, as θ=0.01.

106: based on the 3rd entity probability distribution of each first word, re-training is carried out to First ray marking model, obtain the second sequence labelling model.Namely based on the 3rd entity probability distribution, First ray marking model is optimized, with the characteristic making the second sequence labelling model obtained more meet social networks, makes the second sequence labelling model be applicable to social networks.

Its training process is then using the observational variable of the 3rd entity probability distribution as Training document, is input in First ray marking model, optimizes the parameters of First ray marking model, to obtain the second sequence labelling model.Such as, when First ray marking model is condition random field, the 3rd entity probability distribution can adopt the training patterns of condition random field again to optimize the condition random field of initial construction, using the condition random field after being optimized as the second sequence labelling model.

107: using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of the second sequence labelling model and test document, sequence labelling is carried out to test document, obtains the named entity of each second word in test document.In embodiments of the present invention, second sequence labelling model is an existing sequence labelling model, as condition random field, therefore it can adopt the sequence labelling mode of existence conditions random field to mark test document, and the embodiment of the present invention is no longer described sequence labelling process.

From technique scheme, the named entity recognition method that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.

Prove that the named entity method that the embodiment of the present invention provides is more suitable in social networks with an experimental data below, concrete: use web crawlers to crawl 648 destination objects, obtain 300400, Sina's microblogging text in July, 2013, August altogether, random selecting wherein 1,000 carry out manual mark.XML label is adopted to mark.Use XML label to carry out mark to entity and can formulate entity border and entity type.Such as: " I thinks that this non-serious film of <Movie>IdentityThiefLEssT.LTs sT.LT/Movie> is also very thought-provoking.”。According to the multiple entity type occurred in the microblogging text crawled, be defined as name, mechanism's name, place name, product, film, title, song.Altogether mark out 1,076 entity altogether.Mark work is undertaken by two people are simultaneously parallel.Everyone manually marks the entity occurred in 1000 microblogging texts according to oneself understanding to entity type and border respectively, removes and wherein marks different microbloggings, remaining 857 band named entity class target microblogging texts.

In order to prevent over-fitting, experimental data adopts ten folding cross validations, and its result is as follows:

	Accurate rate	Recall rate	F ₁Value
				Prior art	37.10％	11.03％	16.43％
The present invention	55.12％	23.94％	33.19％

Wherein F ₁=2* accurate rate * recall rate/(accurate rate+recall rate)

Corresponding with said method embodiment, the embodiment of the present invention also provides a kind of named entity recognition device, as shown in Figure 3, can comprise: the first acquiring unit 11, second acquisition unit 12, the 3rd acquiring unit 13, the 4th acquiring unit 14, the 5th acquiring unit 15, training unit 16 and test cell 17.

First acquiring unit 11, for the First ray marking model based on initial construction, sequence labelling is carried out to Training document and test document, obtain the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document, specifically obtain the related description that can refer to embodiment of the method part 101.

Second acquisition unit 12, for obtaining the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word.

3rd acquiring unit 13, for the first object similarity based on the first instance probability distribution of each first word, the first context similarity of each first word and each first word, obtains the 3rd entity probability distribution of corresponding first word.

4th acquiring unit 14, for obtaining the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word.

5th acquiring unit 15, for the second object similarity based on the second instance probability distribution of each second word, the second context similarity of each second word and each second word, obtains the 4th entity probability distribution of corresponding second word.

Preferably, the 4th acquiring unit can comprise: first obtains subelement, second obtains subelement and the 3rd acquisition subelement.Wherein,

First obtains subelement, for obtaining the total amount of the second word in the quantity of the second word jointly had in word bag u and word bag v and word bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding.

Second obtains subelement, for the ratio of the total amount using the quantity of the second word and the second word as the second context similarity.Namely the second context similarity of u and v can represent by Jaccard similarity, and it can be defined as:

J a c c a r d (u, v) = \frac{| u \cap v |}{| u \cup v |}

The span of Jaccard similarity is [0,1], and the similarity between two test document is directly proportional to the size of Jaccard similarity.When two test document are completely uncorrelated, namely there is no identical word between two test document, then Jaccard (u, v)=0; If two test document is identical, then now Jaccard (u, v)=1

3rd obtains subelement, for the second context similarity based on each Training document, obtains the second object similarity between Training document said target object.Then represent with cosine similarity for the second object similarity, cosine similarity be then need measure similarity two test document vectorizations after, calculate the similarity between two vectors.Cosine formula is used to calculate:

C o s i n e (u, v) = \frac{Σ_{i} v_{i} \cdot u_{i}}{\sqrt{Σ_{i} v_{i}^{2}} \cdot \sqrt{Σ_{i} u_{i}^{2}}}

Accordingly, the structure of the 5th acquiring unit 15 as shown in Figure 4, can comprise: the first probability obtains subelement 151, first probability and obtains subelement 152, second probability acquisition subelement 153, second probability and obtain subelement 154, the 3rd probability acquisition subelement 155 and the 4th probability and obtains subelement 156.

First probability obtains subelement 151, for based on the second instance probability distribution of the second word and the second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type.Concrete, can based on following formula:

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{' &Element; T}} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

First probability and acquisition subelement 152, for belonging to the probability of named entity class label c based on the named entity type of the second word, obtain the probability sum of each named entity type of the second word.Concrete, can based on formula:

Second probability obtains subelement 153, for the probability sum of each name real type based on the second word, obtains the named entity probability distribution of the second word in all test document.Concrete, can based on formula:

Second probability and acquisition subelement 154, for the second object similarity based on named entity probability distribution and the second word, obtain the probability sum of named entity class label c.Concrete, can based on formula:

3rd probability obtains subelement 155, and for the probability sum based on named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c.Concrete, can based on formula:

4th probability obtains subelement 156, for belonging to the probability distribution of different named entity class label in named entity recognition class tag set C in the named entity type obtaining the second word, choosing the maximum probability distribution of value is the 4th entity probability distribution.Concrete, can based on formula:

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain the 4th entity probability distribution.

There are four named entity class labels in such as named entity recognition class tag set C, then can obtain four probability distribution p (c|w, the u of the second word w based on above-mentioned formula, U, S, T), each probability distribution p (c|w, u, U, S, T) a corresponding named entity class label, therefrom choose the probability distribution p (c|w of maximum probability, u, U, S, T) as the 4th entity probability distribution.

Training unit 16, for the 3rd entity probability distribution based on each first word, carries out re-training to First ray marking model, obtains the second sequence labelling model.Namely based on the 3rd entity probability distribution, First ray marking model is optimized, with the characteristic making the second sequence labelling model obtained more meet social networks, makes the second sequence labelling model be applicable to social networks.

Test cell 17, for using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of the second sequence labelling model and test document, sequence labelling is carried out to test document, obtains the named entity of each second word in test document.In embodiments of the present invention, second sequence labelling model is an existing sequence labelling model, as condition random field, therefore it can adopt the sequence labelling mode of existence conditions random field to mark test document, and the embodiment of the present invention is no longer described sequence labelling process.

From technique scheme, the named entity recognition device that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

To the above-mentioned explanation of the disclosed embodiments, those skilled in the art are realized or uses the present invention.To be apparent for a person skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a named entity recognition method, is characterized in that, described method comprises:

2. method according to claim 1, it is characterized in that, the second object similarity between the second context similarity of each second word of described acquisition in each self-corresponding Training document and each self-corresponding Training document said target object of each second word, comprising:

3. method according to claim 2, it is characterized in that, the described second context similarity of the described described second instance probability distribution based on each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word, comprising:

4. method according to claim 3, is characterized in that, based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, comprising:

Based on formula

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{'} &Element; T} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

Based on formula

Z(w，u，u′，S，T)＝Σ _c∈Cp(c|w，u，u′，S，T)＝Σ _c∈CΣ _s∈Sβ(s，u′)·p(c|w，u，u′，s，t)

Based on formula

p(c|w，u，u′，S，T)＝Σ _s∈Sβ(s，u′)·p(c|w，u，u′，s，T)/Z(w，u，u′，S，T)

5. method according to claim 4, is characterized in that, the described described second object similarity based on described named entity probability distribution and the second word, obtains the probability sum of named entity class label c, comprising:

Based on formula

p(c|w)＝p(c|w，u，U，S，T)＝Σ _u′∈UΣ _s∈SΣ _{t∈Tandt＝w}p(c|w，u，u′，S，T)·p(c|w，u，u′，s，T)

Based on formula

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain described 4th entity probability distribution.

6. a named entity recognition device, is characterized in that, described device comprises:

7. device according to claim 6, is characterized in that, described 4th acquiring unit comprises:

8. device according to claim 7, is characterized in that, described 5th acquiring unit comprises:

9. device according to claim 8, is characterized in that, described first probability obtains subelement, for based on formula

p (c | w, u, u^{'}, s, T) = Σ_{t &Element; T a n d t &Proportional; w} p (c | t) \cdot \frac{γ (t, u^{'}) \cdot ω (w, t)}{Σ_{t^{'} &Element; T} γ (t, u^{'}) \cdot ω (w, t) + θ} / Z (w, u, u^{'}, s, T)

Described first probability and obtain subelement, for based on formula

Described second probability obtains subelement, for based on formula

10. device according to claim 9, is characterized in that, described second probability and obtain subelement, for based on formula

Described 3rd probability obtains subelement, for based on formula

Described 4th probability obtains subelement, for based on formula

c = \underset{c &Element; C}{\arg \max} p (c | w) = \underset{c &Element; C}{\arg \max} p (c | w, u, U, S, T)

Obtain described 4th entity probability distribution.