CN105550227A - Named entity identification method and device - Google Patents

Named entity identification method and device Download PDF

Info

Publication number
CN105550227A
CN105550227A CN201510889318.4A CN201510889318A CN105550227A CN 105550227 A CN105550227 A CN 105550227A CN 201510889318 A CN201510889318 A CN 201510889318A CN 105550227 A CN105550227 A CN 105550227A
Authority
CN
China
Prior art keywords
word
named entity
probability
probability distribution
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510889318.4A
Other languages
Chinese (zh)
Other versions
CN105550227B (en
Inventor
张晨
谢隆飞
尹泓钦
王全礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN201510889318.4A priority Critical patent/CN105550227B/en
Publication of CN105550227A publication Critical patent/CN105550227A/en
Application granted granted Critical
Publication of CN105550227B publication Critical patent/CN105550227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a named entity identification method and device. After first entity probability distribution of a train document and second entity probability distribution of a test document are obtained by utilizing an initially constructed first sequence annotation model, features, such as the first context similarity of the train document and the first object similarity of the train document and the second context similarity of the test document and the second object similarity of the test document, can be extracted from social network information; therefore, a second sequence annotation model is obtained by training the first context similarity of the train document and the first object similarity of the train document, such that the second sequence annotation model is more suitable for a social network; and in addition, the named entity identification result, which is obtained by performing sequence annotation of the test document based on the second sequence annotation model suitable for the social network, is more accurate.

Description

A kind of named entity recognition method and device
Technical field
The invention belongs to named entity technical field, in particular, particularly relate to a kind of named entity recognition method and device.
Background technology
Named entity refers to the entity with certain sense, as name Lee three, named entity recognition is then identify the entity in text with certain sense, mainly comprise name, place name, mechanism's name and proper noun etc., these entities identified extract the input of task as follow-up, as can as the input of the information extraction tasks such as Relation extraction, event extraction, fine-grained sentiment analysis, therefore the quality of named entity recognition result directly affects the effect that follow-up extracts task.
Current named entity recognition method has had a lot, and if the patent No. is the recognition methods of 201310201310674046.7, its process is: identify the special word in pending text, model entity in pending text is identified, and with the numeric string preset by identified in pending text be the special word replacement of model entity, then commodity entity is carried out on this basis, commodity classification entity, brand entity, the identification of the entities such as item property name entity and item property value entity, this recognition methods is mainly for general text, and the text mainly short text in social networks, as in microblogging or this social networks of QQ, the text majority that user issues is short text, and user can pay close attention to each other in social networks, but current named entity recognition method is not based on this feature, therefore a kind of named entity recognition method being applicable to microblogging or these social networks of QQ is badly in need of.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of named entity recognition method and device, for carrying out the identification of named entity based on social network information, to be applicable to social networks.Technical scheme is as follows:
The invention provides a kind of named entity recognition method, described method comprises:
Based on the First ray marking model of initial construction, sequence labelling is carried out to Training document and test document, obtain the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;
Obtain the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;
Based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtain the 3rd entity probability distribution of corresponding first word;
Obtain the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;
Based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word;
Based on the 3rd entity probability distribution of each first word, re-training is carried out to described First ray marking model, obtain the second sequence labelling model;
Using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.
Preferably, the second object similarity between the second context similarity of each second word of described acquisition in each self-corresponding Training document and each self-corresponding Training document said target object of each second word, comprising:
Obtain in word bag u and word bag v the total amount of the second word in the quantity of the second word jointly had and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;
Using the ratio of the total amount of the quantity of described second word and described second word as described second context similarity;
Based on the second context similarity of described each Training document, obtain the second object similarity between Training document said target object.
Preferably, the described second context similarity of the described described second instance probability distribution based on each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word, comprising:
Based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;
Named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word;
Based on the probability sum of each name real type of described second word, obtain the named entity probability distribution of described second word in all test document;
Based on the described second object similarity of described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;
Based on the probability sum of described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;
When the named entity type obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution.
Preferably, based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, comprising:
Based on formula
p ( c | w , u , u ′ , s , T ) = Σ t ∈ T a n d t ∝ w p ( c | t ) · γ ( t , u ′ ) · ω ( w , t ) Σ t ′ ∈ T γ ( t , u ′ ) · ω ( w , t ) + θ / Z ( w , u , u ′ , s , T )
The named entity type obtaining described second word belongs to the probability of named entity class label c, wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T,
The described named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word, comprising:
Based on formula
Z(w,u,u′,S,T)=∑ c∈Cp(c|w,u,u′,S,T)=∑ c∈Cs∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of described second word, wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ';
The probability sum of described each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document, comprising:
Based on formula
p(c|w,u,u′,S,T)=∑ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of described second word in all test document.
Preferably, the described described second object similarity based on described named entity probability distribution and the second word, obtains the probability sum of named entity class label c, comprising:
Based on formula
Obtain the probability sum of described named entity class label c, wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
The described probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c, comprising:
Based on formula
p(c|w)=p(c|w,u,U,S,T)=∑ u′∈Us∈St∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c;
The described named entity type when obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution, comprising:
Based on formula c = arg max c ∈ C p ( c | w ) = arg max c ∈ C p ( c | w , u , U , S , T ) Obtain described 4th entity probability distribution.
The present invention also provides a kind of named entity recognition device, and described device comprises:
First acquiring unit, for the First ray marking model based on initial construction, sequence labelling is carried out to Training document and test document, obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;
Second acquisition unit, for obtaining the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;
3rd acquiring unit, for based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtains the 3rd entity probability distribution of corresponding first word;
4th acquiring unit, for obtaining the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;
5th acquiring unit, for based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtains the 4th entity probability distribution of corresponding second word;
Training unit, for the 3rd entity probability distribution based on each first word, carries out re-training to described First ray marking model, obtains the second sequence labelling model;
Test cell, for using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.
Preferably, described 4th acquiring unit comprises:
First obtains subelement, for obtaining the total amount of the second word in the quantity of the second word jointly had in word bag u and word bag v and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;
Second obtains subelement, for the ratio of the total amount using the quantity of described second word and described second word as described second context similarity;
3rd obtains subelement, for the second context similarity based on described each Training document, obtains the second object similarity between Training document said target object.
Preferably, described 5th acquiring unit comprises:
First probability obtains subelement, for based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;
First probability and acquisition subelement, for belonging to the probability of named entity class label c based on the named entity type of described second word, obtain the probability sum of each named entity type of described second word;
Second probability obtains subelement, for the probability sum of each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document;
Second probability and acquisition subelement, for the described second object similarity based on described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;
3rd probability obtains subelement, and for the probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;
4th probability obtains subelement, for belonging to the probability distribution of different named entity class label in named entity recognition class tag set C in the named entity type obtaining the second word, choosing the maximum probability distribution of value is described 4th entity probability distribution.
Preferably, described first probability obtains subelement, for based on formula
p ( c | w , u , u ′ , s , T ) = Σ t ∈ T a n d t ∝ w p ( c | t ) · γ ( t , u ′ ) · ω ( w , t ) Σ t ′ ∈ T γ ( t , u ′ ) · ω ( w , t ) + θ / Z ( w , u , u ′ , s , T )
The named entity type obtaining described second word belongs to the probability of named entity class label c, wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T,
Described first probability and obtain subelement, for based on formula
Z(w,u,u′,S,T)=∑ c∈Cp(c|w,u,u′,S,T)=∑ c∈Cs∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of described second word, wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ';
Described second probability obtains subelement, for based on formula
p(c|w,u,u′,S,T)=∑ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of described second word in all test document.
Preferably, described second probability and obtain subelement, for based on formula
Obtain the probability sum of described named entity class label c, wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
Described 3rd probability obtains subelement, for based on formula
p(c|w)=p(c|w,u,U,S,T)=∑ u′∈Us∈St∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c;
Described 4th probability obtains subelement, for based on formula c = arg max c ∈ C p ( c | w ) = arg max c ∈ C p ( c | w , u , U , S , T ) Obtain described 4th entity probability distribution.
Compared with prior art, technique scheme tool provided by the invention has the following advantages:
The technique scheme that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of the named entity recognition method that the embodiment of the present invention provides;
Fig. 2 is the sub-process figure of the named entity recognition method that the embodiment of the present invention provides;
Fig. 3 is the structural representation of the named entity recognition device that the embodiment of the present invention provides;
Fig. 4 is the structural representation of the 5th acquiring unit in the named entity recognition device that provides of the embodiment of the present invention.
Embodiment
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Referring to Fig. 1, it illustrates the process flow diagram of the named entity recognition method that the embodiment of the present invention provides, for identifying the identification of the named entity of each word in each test document in social networks, specifically can comprise the following steps:
101: based on the First ray marking model of initial construction, sequence labelling is carried out to Training document and test document, obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document.
In embodiments of the present invention, First ray marking model is a kind of model that current named entity recognition is commonly used, as condition random field, can be obtained the entity probability distribution of a word in a document by condition random field, such as set X as observation sequence stochastic variable, Y is status switch stochastic variable, x is a document, y is the named entity annotated sequence of the correspondence of document x, and entity probability distribution P (Y|X) is condition random field, and its parameterized form is:
p(Y=y|X=x)=exp(Σ jλ jt j(y i-1,x,i)+Σ kμ ks k(y i,x,i))
Wherein tj (yi-1, yi, x, i) be the i-th position of whole observation sequence and sequence label and the transfer characteristic function of the i-th-1 position, sk (yi, x, i) is the variable of the i-th position of sequence label and the status flag function of whole observation sequence, λ j and μ k is parameter to be estimated, has the corpus of mark to estimate λ j and μ k when model training by using.Word each in Training document and test document is updated in above-mentioned formula, the second instance probability distribution of each second word in the first instance probability distribution of each first word in Training document and each test document can be obtained.
Here it should be noted is that: stochastic variable field is an existing sequence labelling model, its formula obtaining entity probability distribution is an existing computing formula, how those skilled in the art known obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in Training document and each test document from above-mentioned formula if being, the embodiment of the present invention is no longer described in detail it.
102: obtain the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word.
103: based on the first object similarity of the first instance probability distribution of each first word, the first context similarity of each first word and each first word, obtain the 3rd entity probability distribution of corresponding first word.
104: obtain the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word.
105: based on the second object similarity of the second instance probability distribution of each second word, the second context similarity of each second word and each second word, obtain the 4th entity probability distribution of corresponding second word.
In embodiments of the present invention, first context similarity is used to indicate the similarity between Training document, first object similarity is then used to indicate the similarity between Training document said target object, the second same context similarity is used to indicate the similarity between test document, and the second object similarity is used for the similarity between test document said target object.Generally, being arranged in each destination object of social networks may pay close attention to each other, and the document that both sides issue also may be correlated with, and therefore the first context similarity and the first object similarity can be extracted as the feature in social network information.
And the first context similarity is identical with the acquisition process of the second context similarity in embodiments of the present invention, and the first object similarity is also identical with the acquisition process of the second object similarity, when the acquisition process of these two similarities is identical, 3rd entity probability distribution of Training document is also identical with the computation process of the 4th entity probability distribution of test document, in embodiments of the present invention then for test document, be described, first introduce the acquisition process of the second context similarity and the second object similarity.
In second context similarity and the second object similarity are measured in social networks, the method for measuring similarity that the present invention uses is: Jaccard similarity and cosine similarity.Jaccard similarity is the second context similarity of the embodiment of the present invention, it needs to measure the test document of similarity when measuring two and regards the word bag set of all words (in the test document) as, according to the ratio of the quantity of the second word altogether occurred in the quantity of the second word of appearance common in two word bags and two word bags as Jaccard similarity.If the word bag of two test document is respectively u, v, then the Jaccard similarity of u and v can be defined as:
J a c c a r d ( u , v ) = | u ∩ v | | u ∪ v |
The span of Jaccard similarity is [0,1], and the similarity between two test document is directly proportional to the size of Jaccard similarity.When two test document are completely uncorrelated, namely there is no identical word between two test document, then Jaccard (u, v)=0; If two test document is identical, then now Jaccard (u, v)=1.
Then represent with cosine similarity for the second object similarity, cosine similarity be then need measure similarity two test document vectorizations after, calculate the similarity between two vectors.Cosine formula is used to calculate:
C o s i n e ( u , v ) = Σ i v i · u i Σ i v i 2 · Σ i u i 2
The span of cosine similarity is [-1,1].Similarity size between vector and the size of cosine similarity proportional.When two vector direction are completely contrary, Cosine (u, v)=-1; When two vectors are mutually vertical, time namely angle is 90 °, Cosine (u, v)=0; Direction between two vectors is identical, Cosine (u, v)=1.But for text vector, in vector space model, there will not be negative.So the span of cosine similarity is [0,1] in the middle of vector space model.
Why choose Jaccard similarity, because the length of the test document in current social networks is shorter, and Jaccard similarity is relative to other similarities, more be applicable to the shorter document of length, therefore in the embodiment of the present invention, select Jaccard similarity as the computing method of the first context similarity and the second context similarity.
Based on the acquisition process of above-mentioned second context similarity and the second object similarity, the computation process of the 4th entity probability distribution of test document as shown in Figure 2, can comprise the following steps:
201: based on the second instance probability distribution of the second word and the second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type.Concrete, can based on following formula:
p ( c | w , u , u ′ , s , T ) = Σ t ∈ T a n d t ∝ w p ( c | t ) · γ ( t , u ′ ) · ω ( w , t ) Σ t ′ ∈ T γ ( t , u ′ ) · ω ( w , t ) + θ / Z ( w , u , u ′ , s , T )
The named entity type obtaining the second word belongs to the probability of named entity class label c.
Wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T.
And from above-mentioned based on formula, the probability that the named entity type of the second word belongs to named entity class label c is when a given second word w, said target object u, non-targeted object u ', test document s and each the second word w are to entity class distributed collection T, the probability of the second word w.
202: the named entity type based on the second word belongs to the probability of named entity class label c, obtain the probability sum of each named entity type of the second word.Concrete, can based on formula:
Z(w,u,u′,S,T)=∑ c∈Cp(c|w,u,u′,S,T)=∑ c∈Cs∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of the second word, namely Z to represent in named entity recognition class tag set C each named entity class label c at given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to probability sum when entity class distributed collection T.
Wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ', its can by test document and the coupling one by one of non-targeted object obtain.
203: based on the probability sum of each name real type of the second word, obtain the named entity probability distribution of the second word in all test document.Concrete, can based on formula:
p(c|w,u,u′,S,T)=∑ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of the second word in all test document.Namely for given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to entity class distributed collection T, the named entity probability distribution of its named entity class label c can be expressed as: for each the test document s in the middle of test document S set, when given second word w, said target object u, non-targeted object u ' and each the second word w are to entity class distributed collection T, the probability sum of named entity class label c is divided by a normalized factor Z.
204: based on the second object similarity of named entity probability distribution and the second word, obtain the probability sum of named entity class label c.Concrete, can based on formula:
Obtain the probability sum of named entity class label c.Wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
205: based on the probability sum of named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c.Concrete, can based on formula:
p(c|w)=p(c|w,u,U,S,T)=∑ u′∈Us∈St∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c.
206: when the named entity type obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is the 4th entity probability distribution.Concrete, can based on formula:
c = arg max c ∈ C p ( c | w ) = arg max c ∈ C p ( c | w , u , U , S , T ) Obtain the 4th entity probability distribution.
Namely for each second word, the probability distribution of any one named entity class label in named entity recognition class tag set C can be obtained by above-mentioned formula, after the probability distribution obtaining whole named entity class label, therefrom choose the four entity probability distribution of maximum probability distribution as the second word.
There are four named entity class labels in such as named entity recognition class tag set C, then can obtain based on above-mentioned formula twofour probability distribution p (c|w, u, U, S, T) of word w, the corresponding named entity class label of each probability distribution p (c|w, u, U, S, T), therefrom choose the probability distribution p (c|w, u, U, S, T) of maximum probability as the 4th entity probability distribution.
Accordingly, when calculating the 3rd entity probability distribution, be then based on Training document, Training document said target object, non-targeted object, named entity recognition class tag set C and entity class distributed collection T, calculate based on above-mentioned formula.And the value of above-mentioned smoothing factor θ is less, can not affect the result of calculation of above-mentioned formula like this, as θ=0.01.
106: based on the 3rd entity probability distribution of each first word, re-training is carried out to First ray marking model, obtain the second sequence labelling model.Namely based on the 3rd entity probability distribution, First ray marking model is optimized, with the characteristic making the second sequence labelling model obtained more meet social networks, makes the second sequence labelling model be applicable to social networks.
Its training process is then using the observational variable of the 3rd entity probability distribution as Training document, is input in First ray marking model, optimizes the parameters of First ray marking model, to obtain the second sequence labelling model.Such as, when First ray marking model is condition random field, the 3rd entity probability distribution can adopt the training patterns of condition random field again to optimize the condition random field of initial construction, using the condition random field after being optimized as the second sequence labelling model.
107: using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of the second sequence labelling model and test document, sequence labelling is carried out to test document, obtains the named entity of each second word in test document.In embodiments of the present invention, second sequence labelling model is an existing sequence labelling model, as condition random field, therefore it can adopt the sequence labelling mode of existence conditions random field to mark test document, and the embodiment of the present invention is no longer described sequence labelling process.
From technique scheme, the named entity recognition method that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.
Prove that the named entity method that the embodiment of the present invention provides is more suitable in social networks with an experimental data below, concrete: use web crawlers to crawl 648 destination objects, obtain 300400, Sina's microblogging text in July, 2013, August altogether, random selecting wherein 1,000 carry out manual mark.XML label is adopted to mark.Use XML label to carry out mark to entity and can formulate entity border and entity type.Such as: " I thinks that this non-serious film of <Movie>IdentityThiefLEssT.LTs sT.LT/Movie> is also very thought-provoking.”。According to the multiple entity type occurred in the microblogging text crawled, be defined as name, mechanism's name, place name, product, film, title, song.Altogether mark out 1,076 entity altogether.Mark work is undertaken by two people are simultaneously parallel.Everyone manually marks the entity occurred in 1000 microblogging texts according to oneself understanding to entity type and border respectively, removes and wherein marks different microbloggings, remaining 857 band named entity class target microblogging texts.
In order to prevent over-fitting, experimental data adopts ten folding cross validations, and its result is as follows:
Accurate rate Recall rate F 1Value
Prior art 37.10% 11.03% 16.43%
The present invention 55.12% 23.94% 33.19%
Wherein F 1=2* accurate rate * recall rate/(accurate rate+recall rate)
Corresponding with said method embodiment, the embodiment of the present invention also provides a kind of named entity recognition device, as shown in Figure 3, can comprise: the first acquiring unit 11, second acquisition unit 12, the 3rd acquiring unit 13, the 4th acquiring unit 14, the 5th acquiring unit 15, training unit 16 and test cell 17.
First acquiring unit 11, for the First ray marking model based on initial construction, sequence labelling is carried out to Training document and test document, obtain the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document, specifically obtain the related description that can refer to embodiment of the method part 101.
Second acquisition unit 12, for obtaining the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word.
3rd acquiring unit 13, for the first object similarity based on the first instance probability distribution of each first word, the first context similarity of each first word and each first word, obtains the 3rd entity probability distribution of corresponding first word.
4th acquiring unit 14, for obtaining the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word.
5th acquiring unit 15, for the second object similarity based on the second instance probability distribution of each second word, the second context similarity of each second word and each second word, obtains the 4th entity probability distribution of corresponding second word.
In embodiments of the present invention, first context similarity is used to indicate the similarity between Training document, first object similarity is then used to indicate the similarity between Training document said target object, the second same context similarity is used to indicate the similarity between test document, and the second object similarity is used for the similarity between test document said target object.Generally, being arranged in each destination object of social networks may pay close attention to each other, and the document that both sides issue also may be correlated with, and therefore the first context similarity and the first object similarity can be extracted as the feature in social network information.
And the first context similarity is identical with the acquisition process of the second context similarity in embodiments of the present invention, and the first object similarity is also identical with the acquisition process of the second object similarity, when the acquisition process of these two similarities is identical, 3rd entity probability distribution of Training document is also identical with the computation process of the 4th entity probability distribution of test document, in embodiments of the present invention then for test document, be described, first introduce the acquisition process of the second context similarity and the second object similarity.
Preferably, the 4th acquiring unit can comprise: first obtains subelement, second obtains subelement and the 3rd acquisition subelement.Wherein,
First obtains subelement, for obtaining the total amount of the second word in the quantity of the second word jointly had in word bag u and word bag v and word bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding.
Second obtains subelement, for the ratio of the total amount using the quantity of the second word and the second word as the second context similarity.Namely the second context similarity of u and v can represent by Jaccard similarity, and it can be defined as:
J a c c a r d ( u , v ) = | u &cap; v | | u &cup; v |
The span of Jaccard similarity is [0,1], and the similarity between two test document is directly proportional to the size of Jaccard similarity.When two test document are completely uncorrelated, namely there is no identical word between two test document, then Jaccard (u, v)=0; If two test document is identical, then now Jaccard (u, v)=1
3rd obtains subelement, for the second context similarity based on each Training document, obtains the second object similarity between Training document said target object.Then represent with cosine similarity for the second object similarity, cosine similarity be then need measure similarity two test document vectorizations after, calculate the similarity between two vectors.Cosine formula is used to calculate:
C o s i n e ( u , v ) = &Sigma; i v i &CenterDot; u i &Sigma; i v i 2 &CenterDot; &Sigma; i u i 2
The span of cosine similarity is [-1,1].Similarity size between vector and the size of cosine similarity proportional.When two vector direction are completely contrary, Cosine (u, v)=-1; When two vectors are mutually vertical, time namely angle is 90 °, Cosine (u, v)=0; Direction between two vectors is identical, Cosine (u, v)=1.But for text vector, in vector space model, there will not be negative.So the span of cosine similarity is [0,1] in the middle of vector space model.
Accordingly, the structure of the 5th acquiring unit 15 as shown in Figure 4, can comprise: the first probability obtains subelement 151, first probability and obtains subelement 152, second probability acquisition subelement 153, second probability and obtain subelement 154, the 3rd probability acquisition subelement 155 and the 4th probability and obtains subelement 156.
First probability obtains subelement 151, for based on the second instance probability distribution of the second word and the second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type.Concrete, can based on following formula:
p ( c | w , u , u &prime; , s , T ) = &Sigma; t &Element; T a n d t &Proportional; w p ( c | t ) &CenterDot; &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) &Sigma; t &prime; &Element; T &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) + &theta; / Z ( w , u , u &prime; , s , T )
The named entity type obtaining the second word belongs to the probability of named entity class label c.
Wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T.
And from above-mentioned based on formula, the probability that the named entity type of the second word belongs to named entity class label c is when a given second word w, said target object u, non-targeted object u ', test document s and each the second word w are to entity class distributed collection T, the probability of the second word w.
First probability and acquisition subelement 152, for belonging to the probability of named entity class label c based on the named entity type of the second word, obtain the probability sum of each named entity type of the second word.Concrete, can based on formula:
Z(w,u,u′,S,T)=∑ c∈Cp(c|w,u,u′,S,T)=∑ c∈Cs∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of the second word, namely Z to represent in named entity recognition class tag set C each named entity class label c at given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to probability sum when entity class distributed collection T.
Wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ', its can by test document and the coupling one by one of non-targeted object obtain.
Second probability obtains subelement 153, for the probability sum of each name real type based on the second word, obtains the named entity probability distribution of the second word in all test document.Concrete, can based on formula:
p(c|w,u,u′,S,T)=∑ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of the second word in all test document.Namely for given second word w, said target object u, non-targeted object u ', test text S set and each the second word w to entity class distributed collection T, the named entity probability distribution of its named entity class label c can be expressed as: for each the test document s in the middle of test document S set, when given second word w, said target object u, non-targeted object u ' and each the second word w are to entity class distributed collection T, the probability sum of named entity class label c is divided by a normalized factor Z.
Second probability and acquisition subelement 154, for the second object similarity based on named entity probability distribution and the second word, obtain the probability sum of named entity class label c.Concrete, can based on formula:
Obtain the probability sum of named entity class label c.Wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
3rd probability obtains subelement 155, and for the probability sum based on named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c.Concrete, can based on formula:
p(c|w)=p(c|w,u,U,S,T)=∑ u′∈Us∈St∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c.
4th probability obtains subelement 156, for belonging to the probability distribution of different named entity class label in named entity recognition class tag set C in the named entity type obtaining the second word, choosing the maximum probability distribution of value is the 4th entity probability distribution.Concrete, can based on formula:
c = arg max c &Element; C p ( c | w ) = arg max c &Element; C p ( c | w , u , U , S , T ) Obtain the 4th entity probability distribution.
Namely for each second word, the probability distribution of any one named entity class label in named entity recognition class tag set C can be obtained by above-mentioned formula, after the probability distribution obtaining whole named entity class label, therefrom choose the four entity probability distribution of maximum probability distribution as the second word.
There are four named entity class labels in such as named entity recognition class tag set C, then can obtain four probability distribution p (c|w, the u of the second word w based on above-mentioned formula, U, S, T), each probability distribution p (c|w, u, U, S, T) a corresponding named entity class label, therefrom choose the probability distribution p (c|w of maximum probability, u, U, S, T) as the 4th entity probability distribution.
Accordingly, when calculating the 3rd entity probability distribution, be then based on Training document, Training document said target object, non-targeted object, named entity recognition class tag set C and entity class distributed collection T, calculate based on above-mentioned formula.And the value of above-mentioned smoothing factor θ is less, can not affect the result of calculation of above-mentioned formula like this, as θ=0.01.
Training unit 16, for the 3rd entity probability distribution based on each first word, carries out re-training to First ray marking model, obtains the second sequence labelling model.Namely based on the 3rd entity probability distribution, First ray marking model is optimized, with the characteristic making the second sequence labelling model obtained more meet social networks, makes the second sequence labelling model be applicable to social networks.
Its training process is then using the observational variable of the 3rd entity probability distribution as Training document, is input in First ray marking model, optimizes the parameters of First ray marking model, to obtain the second sequence labelling model.Such as, when First ray marking model is condition random field, the 3rd entity probability distribution can adopt the training patterns of condition random field again to optimize the condition random field of initial construction, using the condition random field after being optimized as the second sequence labelling model.
Test cell 17, for using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of the second sequence labelling model and test document, sequence labelling is carried out to test document, obtains the named entity of each second word in test document.In embodiments of the present invention, second sequence labelling model is an existing sequence labelling model, as condition random field, therefore it can adopt the sequence labelling mode of existence conditions random field to mark test document, and the embodiment of the present invention is no longer described sequence labelling process.
From technique scheme, the named entity recognition device that the embodiment of the present invention provides is after utilizing the First ray marking model of initial construction to obtain the first instance probability distribution of Training document and the second instance probability distribution of test document, feature can be extracted from social network information, as the first context similarity of Training document and the first object similarity of Training document, second context similarity of test document and the second object similarity of test document, the the second sequence labelling model obtained is being trained like this based on the first context similarity of Training document and the first object similarity of Training document, the second sequence labelling model is made to be more suitable for social networks, and then when carrying out sequence labelling based on the second sequence labelling model being applicable to social networks to test document, the recognition result of the named entity obtained is more accurate.
It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
To the above-mentioned explanation of the disclosed embodiments, those skilled in the art are realized or uses the present invention.To be apparent for a person skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a named entity recognition method, is characterized in that, described method comprises:
Based on the First ray marking model of initial construction, sequence labelling is carried out to Training document and test document, obtain the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;
Obtain the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;
Based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtain the 3rd entity probability distribution of corresponding first word;
Obtain the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;
Based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word;
Based on the 3rd entity probability distribution of each first word, re-training is carried out to described First ray marking model, obtain the second sequence labelling model;
Using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.
2. method according to claim 1, it is characterized in that, the second object similarity between the second context similarity of each second word of described acquisition in each self-corresponding Training document and each self-corresponding Training document said target object of each second word, comprising:
Obtain in word bag u and word bag v the total amount of the second word in the quantity of the second word jointly had and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;
Using the ratio of the total amount of the quantity of described second word and described second word as described second context similarity;
Based on the second context similarity of described each Training document, obtain the second object similarity between Training document said target object.
3. method according to claim 2, it is characterized in that, the described second context similarity of the described described second instance probability distribution based on each second word, each second word and the described second object similarity of each second word, obtain the 4th entity probability distribution of corresponding second word, comprising:
Based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;
Named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word;
Based on the probability sum of each name real type of described second word, obtain the named entity probability distribution of described second word in all test document;
Based on the described second object similarity of described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;
Based on the probability sum of described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;
When the named entity type obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution.
4. method according to claim 3, is characterized in that, based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, comprising:
Based on formula
p ( c | w , u , u &prime; , s , T ) = &Sigma; t &Element; T a n d t &Proportional; w p ( c | t ) &CenterDot; &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) &Sigma; t &prime; &Element; T &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) + &theta; / Z ( w , u , u &prime; , s , T )
The named entity type obtaining described second word belongs to the probability of named entity class label c, wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T,
The described named entity type based on described second word belongs to the probability of named entity class label c, obtains the probability sum of each named entity type of described second word, comprising:
Based on formula
Z(w,u,u′,S,T)=Σ c∈Cp(c|w,u,u′,S,T)=Σ c∈CΣ s∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of described second word, wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ';
The probability sum of described each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document, comprising:
Based on formula
p(c|w,u,u′,S,T)=Σ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of described second word in all test document.
5. method according to claim 4, is characterized in that, the described described second object similarity based on described named entity probability distribution and the second word, obtains the probability sum of named entity class label c, comprising:
Based on formula
Obtain the probability sum of described named entity class label c, wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
The described probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c, comprising:
Based on formula
p(c|w)=p(c|w,u,U,S,T)=Σ u′∈UΣ s∈SΣ t∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c;
The described named entity type when obtaining the second word belongs to the probability distribution of different named entity class label in named entity recognition class tag set C, choosing the maximum probability distribution of value is described 4th entity probability distribution, comprising:
Based on formula c = arg max c &Element; C p ( c | w ) = arg max c &Element; C p ( c | w , u , U , S , T ) Obtain described 4th entity probability distribution.
6. a named entity recognition device, is characterized in that, described device comprises:
First acquiring unit, for the First ray marking model based on initial construction, sequence labelling is carried out to Training document and test document, obtains the second instance probability distribution of each second word in the first instance probability distribution of each first word in each Training document and each test document;
Second acquisition unit, for obtaining the first object similarity between the first context similarity of each first word in each self-corresponding Training document and each self-corresponding Training document said target object of each first word;
3rd acquiring unit, for based on the described first context similarity of the described first instance probability distribution of each first word, each first word and the described first object similarity of each first word, obtains the 3rd entity probability distribution of corresponding first word;
4th acquiring unit, for obtaining the second object similarity between the second context similarity of each second word in each self-corresponding Training document and each self-corresponding Training document said target object of each second word;
5th acquiring unit, for based on the described second context similarity of the described second instance probability distribution of each second word, each second word and the described second object similarity of each second word, obtains the 4th entity probability distribution of corresponding second word;
Training unit, for the 3rd entity probability distribution based on each first word, carries out re-training to described First ray marking model, obtains the second sequence labelling model;
Test cell, for using the observational variable of the 4th entity probability distribution of the second word each in each test document as corresponding test document, and based on the observational variable of described second sequence labelling model and described test document, sequence labelling is carried out to described test document, obtains the named entity of each second word in described test document.
7. device according to claim 6, is characterized in that, described 4th acquiring unit comprises:
First obtains subelement, for obtaining the total amount of the second word in the quantity of the second word jointly had in word bag u and word bag v and institute predicate bag u and word bag v, wherein word bag u is the set of words of the Training document that second word is corresponding, and word bag v is the set of words of the Training document that another the second word is corresponding;
Second obtains subelement, for the ratio of the total amount using the quantity of described second word and described second word as described second context similarity;
3rd obtains subelement, for the second context similarity based on described each Training document, obtains the second object similarity between Training document said target object.
8. device according to claim 7, is characterized in that, described 5th acquiring unit comprises:
First probability obtains subelement, for based on the second instance probability distribution of the second word and the described second context similarity of the second word, the named entity type obtaining the second word belongs to the probability of named entity class label c, wherein named entity class label c is arranged in named entity recognition class tag set C, and is used to indicate a kind of named entity type;
First probability and acquisition subelement, for belonging to the probability of named entity class label c based on the named entity type of described second word, obtain the probability sum of each named entity type of described second word;
Second probability obtains subelement, for the probability sum of each name real type based on described second word, obtains the named entity probability distribution of described second word in all test document;
Second probability and acquisition subelement, for the described second object similarity based on described named entity probability distribution and the second word, obtain the probability sum of named entity class label c;
3rd probability obtains subelement, and for the probability sum based on described named entity class label c, the named entity type obtaining the second word belongs to the probability distribution of named entity class label c;
4th probability obtains subelement, for belonging to the probability distribution of different named entity class label in named entity recognition class tag set C in the named entity type obtaining the second word, choosing the maximum probability distribution of value is described 4th entity probability distribution.
9. device according to claim 8, is characterized in that, described first probability obtains subelement, for based on formula
p ( c | w , u , u &prime; , s , T ) = &Sigma; t &Element; T a n d t &Proportional; w p ( c | t ) &CenterDot; &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) &Sigma; t &prime; &Element; T &gamma; ( t , u &prime; ) &CenterDot; &omega; ( w , t ) + &theta; / Z ( w , u , u &prime; , s , T )
The named entity type obtaining described second word belongs to the probability of named entity class label c, wherein w is the second word, s is a test document, u is test document s said target object, u ' is non-targeted object, T is the entity class distributed collection of the second word, p (c|t) is second instance probability distribution, γ is a 0-1 function, for judging whether the second word w appears in the u ' that destination object u pays close attention to, ω is the second context similarity, θ is a smoothing factor, Z represents that in named entity knowledge class tag set C, each named entity class label c is at the second word w, destination object u, non-targeted object u ', test document s and each the second word w is to probability sum when entity class distributed collection T,
Described first probability and obtain subelement, for based on formula
Z(w,u,u′,S,T)=Σ c∈Cp(c|w,u,u′,S,T)=Σ c∈CΣ s∈Sβ(s,u′)·p(c|w,u,u′,s,t)
Obtain the probability sum of each named entity type of described second word, wherein S is test document set, and β is a 0-1 function, for judging whether a test document belongs to non-targeted object u ';
Described second probability obtains subelement, for based on formula
p(c|w,u,u′,S,T)=Σ s∈Sβ(s,u′)·p(c|w,u,u′,s,T)/Z(w,u,u′,S,T)
Obtain the named entity probability distribution of described second word in all test document.
10. device according to claim 9, is characterized in that, described second probability and obtain subelement, for based on formula
Obtain the probability sum of described named entity class label c, wherein U is the set of non-destination object u ', and α is a 0-1 function, judges whether between destination object u and non-targeted object u ' be concern relation, be the second object similarity, θ is a smoothing factor;
Described 3rd probability obtains subelement, for based on formula
p(c|w)=p(c|w,u,U,S,T)=Σ u′∈UΣ s∈SΣ t∈Tandt=wp(c|w,u,u′,S,T)·p(c|w,u,u′,s,T)
The named entity type obtaining described two words belongs to the probability distribution of named entity class label c;
Described 4th probability obtains subelement, for based on formula c = arg max c &Element; C p ( c | w ) = arg max c &Element; C p ( c | w , u , U , S , T ) Obtain described 4th entity probability distribution.
CN201510889318.4A 2015-12-07 2015-12-07 Named entity identification method and device Active CN105550227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510889318.4A CN105550227B (en) 2015-12-07 2015-12-07 Named entity identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510889318.4A CN105550227B (en) 2015-12-07 2015-12-07 Named entity identification method and device

Publications (2)

Publication Number Publication Date
CN105550227A true CN105550227A (en) 2016-05-04
CN105550227B CN105550227B (en) 2020-05-22

Family

ID=55829416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510889318.4A Active CN105550227B (en) 2015-12-07 2015-12-07 Named entity identification method and device

Country Status (1)

Country Link
CN (1) CN105550227B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108304933A (en) * 2018-01-29 2018-07-20 北京师范大学 A kind of complementing method and complementing device of knowledge base
CN108717410A (en) * 2018-05-17 2018-10-30 达而观信息科技(上海)有限公司 Name entity recognition method and system
CN110096695A (en) * 2018-01-30 2019-08-06 腾讯科技(深圳)有限公司 Hyperlink label method and apparatus, file classification method and device
CN110298043A (en) * 2019-07-03 2019-10-01 吉林大学 A kind of vehicle name entity recognition method and system
CN110851597A (en) * 2019-10-28 2020-02-28 青岛聚好联科技有限公司 Method and device for sentence annotation based on similar entity replacement
CN111178073A (en) * 2018-10-23 2020-05-19 北京嘀嘀无限科技发展有限公司 Text processing method and device, electronic equipment and storage medium
CN111339773A (en) * 2018-12-18 2020-06-26 富士通株式会社 Information processing method, natural language processing method, and information processing apparatus
CN112825112A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
JP2014119977A (en) * 2012-12-17 2014-06-30 Nippon Telegr & Teleph Corp <Ntt> Daily word extractor, method, and program
US20150186355A1 (en) * 2013-12-26 2015-07-02 International Business Machines Corporation Adaptive parser-centric text normalization
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
JP2014119977A (en) * 2012-12-17 2014-06-30 Nippon Telegr & Teleph Corp <Ntt> Daily word extractor, method, and program
US20150186355A1 (en) * 2013-12-26 2015-07-02 International Business Machines Corporation Adaptive parser-centric text normalization
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108304933A (en) * 2018-01-29 2018-07-20 北京师范大学 A kind of complementing method and complementing device of knowledge base
CN110096695A (en) * 2018-01-30 2019-08-06 腾讯科技(深圳)有限公司 Hyperlink label method and apparatus, file classification method and device
CN110096695B (en) * 2018-01-30 2023-01-03 腾讯科技(深圳)有限公司 Hyperlink marking method and device and text classification method and device
CN108717410A (en) * 2018-05-17 2018-10-30 达而观信息科技(上海)有限公司 Name entity recognition method and system
CN111178073A (en) * 2018-10-23 2020-05-19 北京嘀嘀无限科技发展有限公司 Text processing method and device, electronic equipment and storage medium
CN111339773A (en) * 2018-12-18 2020-06-26 富士通株式会社 Information processing method, natural language processing method, and information processing apparatus
CN110298043A (en) * 2019-07-03 2019-10-01 吉林大学 A kind of vehicle name entity recognition method and system
CN110851597A (en) * 2019-10-28 2020-02-28 青岛聚好联科技有限公司 Method and device for sentence annotation based on similar entity replacement
CN112825112A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal

Also Published As

Publication number Publication date
CN105550227B (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN105550227A (en) Named entity identification method and device
Yao et al. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model
CN104463548B (en) A kind of acknowledgement of consignment Quantitatively Selecting method under multifactor impact
CN108229590A (en) A kind of method and apparatus for obtaining multi-tag user portrait
CN104346440A (en) Neural-network-based cross-media Hash indexing method
CN103020482A (en) Relation-based spam comment detection method
CN105205096A (en) Text modal and image modal crossing type data retrieval method
CN106485271A (en) A kind of zero sample classification method based on multi-modal dictionary learning
Tanudjaja et al. Exploring bibliometric mapping in NUS using BibExcel and VOSviewer
CN103605970A (en) Drawing architectural element identification method and system based on machine learning
CN108256914A (en) A kind of point of interest category forecasting method based on tensor resolution model
CN106934035A (en) Concept drift detection method in a kind of multi-tag data flow based on class and feature distribution
CN105740382A (en) Aspect classification method for short comment texts
CN104103011B (en) Suspicious taxpayer recognition method based on taxpayer interest incidence network
CN103324708A (en) Method of transfer learning from long text to short text
CN104881796A (en) False comment judgment system based on comment content and ID recognition
CN104598510A (en) Event trigger word recognition method and device
CN105488522A (en) Search engine user information demand satisfaction evaluation method capable of integrating multiple views and semi-supervised learning
CN102945372B (en) Classifying method based on multi-label constraint support vector machine
Qianqian et al. The China-Pakistan economic corridor: The Pakistani media attitudes perspective
CN103123685B (en) Text mode recognition method
CN103793474B (en) Knowledge management oriented user-defined knowledge classification method
CN103942224B (en) A kind of method and device for the mark rule obtaining web page release
Guo et al. Bifurcation analysis of an age-structured alcoholism model
CN107644101A (en) Information classification approach and device, information classification equipment and computer-readable medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant