CN102314417A

CN102314417A - Method for identifying Web named entity based on statistical model

Info

Publication number: CN102314417A
Application number: CN201110284429A
Authority: CN
Inventors: 王静; 刘志镜; 曲建铭; 王燕; 贺文华; 王炜华; 王纵虎; 陈东辉; 姚勇; 朱旭东; 赵辉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2011-09-22
Filing date: 2011-09-22
Publication date: 2012-01-11

Abstract

The invention discloses a method for identifying a Web named entity based on a statistical model. The method comprises the following steps of: representing multiple characteristics of the Web named entity with structure and text characteristics; combining a statistical method with a rule method and adopting an improved MR-GHMM (MR-Generalized Hidden Markov Model) to increase the training efficiency; marking the entity with the improved GHMM, and marking each named entity to realize entity identification; and processing a Web complex named entity identifying process on two layers and performing complex nested entity identification by taking a marking result of a first layer as the input of second layer processing. Compared with an original identifying algorithm, the method has the advantages that: the identifying accuracy of an algorithm used in the method is increased, and the time complexity of model training is lowered greatly. By representing multiple characteristics of the Web named entity and modifying entity characteristics in different fields, named entities in different fields on Web can be identified.

Description

Web named entity recognition method based on statistical model

Technical field

The invention belongs to the natural language processing technique field, relate generally to Web information extraction field, relate in particular to the Web named entity recognition.Specifically be a kind of Web named entity recognition method, mainly be used for identifying the Web named entity, realize obtaining and pre-service info web based on statistical model.

Background technology

Web named entity recognition technology is primarily aimed at the information of the Web page and carries out obtaining of master data.Thereby can discern the content of webpage through obtaining data, also be that follow-up various application such as information extraction, automatic question answering and translation thereof etc. all need the technological support of named entity recognition, and this also is groundwork in the natural language processing.In the network technology develop rapidly, and be widely used in the current of every field, extremely important to its research.In general, named entity recognition is exactly to one or more pending texts, identifies the named entity that wherein occurs, such as: name, place name, mechanism's name, Time of Day, numeral etc.

At present, English named entity recognition has been obtained good effect, its research and development are mainly concentrated on learning areas, comprised hidden Markov model, maximum entropy model and SVMs etc., some systems can practical application.When the 7th the comprehension of information meeting (MUC-7), best English named entity recognition system has reached 95% recall ratio and 92% precision ratio.Compare with English named entity recognition, the effect of Chinese named entity identification is also far short of what is expected.When second multilingual entity estimated meeting (MET-2), best Chinese named entity recognition system was respectively 66,89,89% at the precision ratio of name, place name, mechanism's name, and recall ratio is respectively 92,91,88%.

Chinese named entity identification at present mainly is based on rule and these two kinds of methods of statistics on method.The mode that rule-based method generally adopts tagged word or characteristic speech to trigger is carried out named entity recognition.Method based on statistics is mainly passed through, and extensive corpus named entity and context thereof are carried out statistical study, makes up statistical model and carries out named entity recognition.

Early stage Chinese named entity model of cognition comprises several submodels, and each submodel is handled a certain type of entity, possibly use rule-based method like the identification to name, possibly use the method based on statistics to the identification of place name, mechanism's name.For example, hidden Markov model, probability CFG, language model, maximum entropy language model, conditional random field models etc. based on decision tree.Just occurred various improved models subsequently, different entities has been handled with unified model.

Traditional recognition method do not consider the entity of discerning some display structure characteristics in Web, like this will be not comprehensive for the character representation of Web entity.More in addition, traditional recognition method can be set up different models to different entities to the identification of Web named entity, can't be that a this situation of composition of corpus separatum or other complicated entities is handled on earth to those so just.Setting up a plurality of models simultaneously also can increase the time complexity of identification greatly.Last point, classic method need a large amount of text datas in training process, so model too relies on the size of training text collection.The time that existing named entity recognition model spends on training sample is too big.

Better to the simple entity recognition effect in the identification of Chinese named entity at present, to complicated entity, especially for nested complicated entity, recognition efficiency and accuracy rate are lower.

Project team of the present invention does not find report or the document closely related and the same with the present invention more as yet to domestic and international patent documentation and the journal article retrieval of publishing.

Summary of the invention

The present invention is a kind of named entity recognition method based on statistical model, mainly is that the Web document is carried out pre-service, is the information extraction of back, and mechanical translation and question answering system provide basic guarantee.The named entity that the present invention is primarily aimed on the Web utilizes statistical model to carry out named entity recognition.The subject matter that the present invention will solve is the identification of existing Web Chinese named entity, and is especially not high enough to the accuracy of identification of complicated entity, accurate inadequately problem.

Be elaborated in the face of the present invention down

The present invention is a kind of Web named entity recognition method based on statistical model, it is characterized in that: said method comprising the steps of:

A. the original language material of Web text being carried out the pre-service of participle, and urtext is mapped on the abstract symbol collection, is machine learning afterwards, carries out the symbolism description of text and prepares;

B. named entity is set up the corresponding structure characteristic and text feature is represented, set up the feature database of named entity, use the character representation method of many eigenvectors of MFVSM, each named entity of the Web page is carried out feature extraction;

C. the applied probability statistic algorithm is set up the MR-GHMM model, utilizes Baum-Welch algorithm computation original state probability, transfering state probability and the state promoted to discharge probability, promptly solves the problem concerning study of MR-GHMM;

D. the many characteristics that combine the Web named entity; A kind of improved back-off model is introduced in the calculating of GHMM model; Adopt the Viterbi algorithm from all possible mark sequence, to optimize the maximum mark sequence of probability as final annotation results; And, realize being applicable to the Web named entity recognition of many characteristics to each named entity mark;

The E.MR-GHMM model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; The second layer carries out complicated nested Entity recognition, utilizes MR-GHMM that its transition probability is calculated, and with the input that the annotation results of ground floor is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition.

In existing method, generally adopt single text feature to carry out the feature description of entity, the foundation of model simultaneously is just to corpus separatum.The time that existing named entity recognition model spends on training sample is too big.The maximum probability that the present invention is directed to the GHMM statistical model is found the solution and is improved, thereby optimizes the efficient of training.To the characteristics of Web entity, adopt the architectural feature of entity and many character representations that text feature carries out the Web named entity on the other hand, improve accuracy of identification.

Realization of the present invention also is: the named entity feature extraction among the step B comprises the steps:

B1. the display styles that at first webpage is carried out the Web named entity is represented, forms architectural feature vector

B2. again the Web named entity of webpage is carried out text feature and represent, convert text feature into a limited eigenvector

B3. train according to sample data, use MFVSM to carry out many eigenvectors character representation of each named entity of the Web page: realizes the feature extraction of named entity.

The present invention combines the architectural feature of Web text and text feature and carries out many character representations of entity, thereby can more comprehensively express the characteristic of entity in the Web text.For follow-up Entity recognition is laid a solid foundation.

Realization of the present invention also is: the MR-GHMM model of setting up among the said step C comprises the steps:

C1. calculate the parameter of MR-GHMM model;

C2. according to setting up good character representation in the feature database, original expectation is trained, obtained the transition probability of named entity, thereby obtain the probability P of model;

C3. for given model λ, find out the state transitions sequence Q that makes P (O, Q| λ) maximum.

HMM is a kind of statistical model that in natural language processing field, is widely used.Consider more character representation thereby it is expanded to broad sense HMM GHMM, thereby be more suitable for the identification of the complicated many feature entities such as name identification, place name identification and mechanism's name identification in the Chinese named entity identification.

Realization of the present invention also is: the identification of the Web named entity among the said step D comprises the steps:

D1. carry out the characteristic speech with the Viterbi algorithm and mark automatically, that is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results;

D2. for P (T ⁿ) calculating adopt natural language processing (n-gram language model) based on probability statistics, calculate a sentence T ⁿ=(t ₁, t ₂..., t _m) probability:

T^{#} = \underset{T}{\arg ma} x \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (\log P (T^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) + Σ_{i = 1}^{n} \log P (t_{i} | G^{n}))

T wherein ⁿ=(t ₁, t ₂..., t _m) be G ⁿ=(g ₁, g ₂..., g _m) certain possible characteristic speech mark sequence;

D3. for the computing method of

, adopt a kind of improved back-off model to calculate.The expression of improved back-off model is following:

P_{bo} (h | h^{'}, g) = \{\begin{matrix} P_{GT} (h | h^{'}, g) & if & C (h, h^{'}, g) > 0 \\ α (h^{'}, g) P_{bo} (h | h^{'}) & otherwise \end{matrix}

P wherein _Bo(h|g ') is the new probability formula of three gram language model, wherein h=t _I-n+1... t _I-1, h '=t _I-n+2... t _I-1

Through utilizing many characteristics of Web named entity; The present invention introduces the GHMM model and comes the Web named entity is set up probability model; In the process of Model Identification, the present invention introduces kind of an improved back-off model and comes the computation complexity of Optimization Model, thereby improves the efficient of Web named entity recognition.

Realization of the present invention also is: the MR-GHMM model is handled Web named entity recognition process as two layers, and ground floor carries out the simple entity mark; The second layer carries out complicated nested Entity recognition, utilizes MR-GHMM that its transition probability is calculated, and with the input that the annotation results of ground floor is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition.

The present invention combines statistical method and rule and method, adopts improved MR-GHMM and carries out named entity recognition to the recognition methods that many character representations method of Web named entity combines.Model is divided into two-layer, realizes identification complex named entities.

In general, many character representations of the Web named entity recognition of the present invention's proposition and improved statistical model come the better identification of realization Web named entity.Because this method that the present invention proposes has overcome some shortcomings of classic method, thereby has better realized the identification of Web named entity, also improve for identification efficiency simultaneously.

Compared with prior art, the present invention's advantage specific as follows:

(1) classic method is not taken into account the architectural feature of Web when carrying out character representation, and the present invention combines the text feature of the architectural feature of Web and considers, more comprehensively carries out the feature description of Web entity.Thereby improve the accuracy rate of Web named entity recognition.

(2) classic method will be to the single entity modeling; The method that the present invention adopts multilayer hidden Markov model and substance feature epiphase to combine; The identification of various different named entities is placed under the united frame, adopts two layer models that complex named entities is discerned, can discern more efficiently.

(3) in modeling process, the present invention introduces kind of an improved back-off model and comes the computation complexity of Optimization Model, and the text of binding entity and architectural feature carry out the character representation of named entity, thus the efficient of raising Web named entity recognition.

(4) the present invention combines statistical method and rule and method, adopts improved MR-GHMM and to the recognition methods that many character representations method of Web named entity combines, better carries out named entity recognition.

Description of drawings:

Fig. 1 is that the present invention adopts MR-GHMM to carry out the named entity recognition schematic flow sheet;

Fig. 2 the present invention is directed to the curve map of quantity different training sample to the extraction performance F value of named entity;

Embodiment:

Below in conjunction with accompanying drawing the present invention is elaborated:

Embodiment 1:

The present invention is a kind of named entity recognition method based on statistical model, mainly is that the Web document on the webpage is carried out pre-service, is the information extraction of back, and mechanical translation and question answering system provide basic guarantee.

With the recruitment website is example, and the recruitment information that the present invention is directed on the Web utilizes statistical model to carry out named entity recognition, and the named entity in the recruitment information mainly is place, time, mechanism and four types of entities of position, and the experiment flow of identification is as shown in Figure 1.Experimental data comes from Zhaopin.com's page or leaf in this routine table 1, chooses to comprise computing machine biomedicine, building industry, environmental protection, mechanical chemical industry, six types of recruitments such as secretarial webpage.Respectively these webpages are carried out the position name, the recruitment organization names, work place and the entity of recruitment time extract.Adopt the recognition methods of improved MR-GHMM to carry out the identification of entity.

Table 1 experiment sample collection

To the nested characteristics of position and mechanism's name, the present invention identifies simple named entity earlier on word order cutting result's basis, and the result who again MR-GHMM is discerned passes to high-rise MR-GHMM and realizes nested named entity recognition.In this process, the present invention adopts the method based on the Chinese named entity identification of many mark sheets.Whole identification process is as shown in Figure 1.

Recognition methods of the present invention may further comprise the steps:

A. the original language material of Web text is carried out the pre-service of participle; Carry out participle according to basic dictionary, the symbolism pre-service is carried out in original expectation, and urtext is mapped on the abstract symbol collection; Be machine learning afterwards, carry out the symbolism description of text and prepare.

B. named entity is set up the corresponding structure characteristic and text feature is represented, set up the feature database of named entity, use the character representation method of many eigenvectors of MFVSM, each named entity of the Web page is carried out feature extraction.

B1. the display styles that at first webpage is carried out the Web named entity is represented, forms the architectural feature that architectural feature vector

obtains the Web named entity.

B2. again the Web named entity of webpage is carried out text feature and represent, convert text feature into text feature that a limited eigenvector

obtains the Web named entity.

B3. train according to sample data, use MFVSM to carry out many eigenvectors character representation of each named entity of the Web page:

realizes the feature extraction of named entity.

C1. calculate the parameter of MR-GHMM model:

D2. for P (T ⁿ) calculating adopt natural language processing based on probability statistics, calculate a sentence T ⁿ=(t ₁, t ₂..., t _m) probability:

T^{#} = \underset{T}{\arg ma} x \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (\log P (T^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) + Σ_{i = 1}^{n} \log P (t_{i} | G^{n}))

D3. computing method for

; Adopt a kind of improved back-off model to calculate, the expression of improved back-off model is following:

P_{bo} (h | h^{'}, g) = \{\begin{matrix} P_{GT} (h | h^{'}, g) & if & C (h, h^{'}, g) > 0 \\ α (h^{'}, g) P_{bo} (h | h^{'}) & otherwise \end{matrix}

P wherein _Bo(h|h ') is the new probability formula of three gram language model, wherein h=t _I-n+1... t _I-1, h '=t _I-n+2... t _I-1

E. model is handled Web named entity recognition process as two layers; Ground floor carries out the simple entity mark; With the input of annotation results as second layer processing; Utilize GHMM that its transition probability is calculated, be transported to the identification GHMM of mechanism to the place name that identifies as mechanism's name class, thereby identify mechanism.Specifically be divided into following two steps:

E1. identify earlier the named entity of date entity, place name entity and simple non-nesting respectively; Identify wherein date and place name simple entity earlier; Mark; All date names that are about to identify convert into < DATA >, DATA >, and place name convert into < LOC >, LOC >, obtaining the phase one is the text marking collection of ground floor;

E2. on the basis of text mark collection; It is the text marking collection of ground floor; According to the character representation of position name and mechanism's name, utilize the second layer model of MR-GHMM to carry out mark, and all position names that will identify convert into < POS >, POS >; Mechanism's name convert into < ORG >, ORG >, thus accomplish the identification of all named entities.

Webpage to recruitment information carries out named entity recognition.Concerning the applicant, can better, more comprehensively obtain talent market on the one hand; Be concerned about simultaneously the unit and the researcher of education, obtain market feedback and also have certain directive significance for the arrangement of its subject for colleges and universities, culture units etc.

Embodiment 2:

With embodiment 1, the named entity feature extraction among the present invention in the step 2 further specifies based on the Web named entity recognition method of statistical model:

(1) the architectural feature vector of Web named entity is analyzed as follows:

Because the named entity in the webpage can show with the mode of stressing usually, so in identification, just can these characteristics be taken into account.For example, the position name shows that with the large size red font display mode just obviously is different from other text; These characteristics of Web named entity mainly are to be used for stressing some important information, also are the user friendly requirements of browsing simultaneously.

The display styles that at first webpage is carried out the Web named entity is represented, forms eigenvector

What architectural feature referred to is exactly the display styles of Web object, and single (CSS) attribute of CSS of introducing Web is described the architectural feature of Web.Through the physical training, to obtain physical properties of the structural characteristics, as shown in particular in Table 2, such as the font style including font type

Font Size

Font Style

font weight

and font color

Text Styles

, including text-decoration

first paragraph spaces

and horizontal alignment

Background Styles

including the background color

Background Image background-repeat

Background fixed

and the background positioning

Table 2 Web named entity architectural feature

Architectural feature is introduced in the feature description of Web entity, reflection Web entity that more can be concrete is different from some characteristics of traditional plain text, for the Web named entity recognition provides more effective feature description.

(2) the text feature vector

of Web named entity is analyzed as follows:

The contextual feature of the entity object just that in fact text feature refers to, definition is referring to table 3 and table 4.In the identifying, only need certain speech as the probability of characteristic speech and the transition probability between the characteristic speech.Choosing not only of all kinds of named entity signature collection need scientifically be set in conjunction with expertise according to himself characteristic, but also will label sets be adjusted through experiment constantly.

The present invention takes all factors into consideration text and architectural feature, make its characteristic be more suitable in the Web entity and represent, thereby more accurate for the identification of web named entity.

Again the Web named entity is carried out text feature and represent, convert text feature into a limited eigenvector<img file="BSA000005794017000819.GIF" he="44" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="71" />If gi=<fi, wi>, F whereinn=(f1, f2..., fn) be the characteristic sequence of speech, and Wn=(w1, w2..., wn) be the sequence of speech.For characteristic f sequence, the signature of being introduced below using.

Among the present invention, the position name is exactly a kind of special, complicated named entity.At first, the position name tends to include the speech of representing different work posts, like " slip-stick artist ", " teacher " etc.; Secondly, position name length is also fixing, and some position name reaches tens even tens words, but the abbreviation that has has only two words, length and border be difficult to confirm also to make the more difficult identification of position name, like " sale and after-sales-service engineer ", " skilled worker " etc.; In addition, tend to be nested with place name in the position name, like " Shanghai Manager of Branch ", the identification that also influences the position name nested against one another.Thus it is clear that, to the identification of position name and complicated equally to the identification of mechanism's name.

The MR-GHMM model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; Such as " slip-stick artist ", " teacher " and " Shanghai " etc. are marked as simple entity.When the second layer carries out complicated nested Entity recognition; With the input as second layer processing such as above-mentioned annotation results " slip-stick artist ", " teacher " and " Shanghai "; On the simple entity basis that ground floor identifies, carry out complicated nested Entity recognition, thereby identify " selling and after-sales-service engineer ".

With regard to recruitment information, the content of usually filling in the industry is normally wide in range, if in post information, can just can make specialty be suitable for information more accurately and accurate extracting and replenish.

In the table 3, provided the content characteristic of position name.The internal feature of position name; Expression position name mainly is to be made up of occupational title and academic title; Promptly constitute by the general prefix of characteristic speech and the general suffix of characteristic speech: such as; " environmental protection " is exactly the general prefix of characteristic speech in " green technology teacher ", and " technological teacher " then is the general suffix of characteristic speech, and common position name characteristic speech also has slip-stick artist, designer, manager, employee etc.

Table 3 position name mark sheet

Characteristic speech mark	Meaning	Example
			P_d	Hereinafter triggers characteristic	Secretarial/some names
P_u	Preceding text trigger characteristic	Recruitment/Financial Assistant; Employ sincerely/the house property middle man; 4/security personnel
			P_c	Connect and trigger characteristic	Assistant's recruitment and staff relationship manager
P_e	Other triggers characteristic
			P_s	The general suffix of characteristic speech	Financial executive, QA Manager, process engineer, sales director
P_ss	The special suffix of characteristic speech	Cashier, recreation customer service, senior buying
			P_p	The general prefix of characteristic speech	Embedded system slip-stick artist, green technology person, realtor
P_sp	The special prefix of characteristic speech	The authentication consultant of Shanghai branch office
			P_l	The characteristic speech of position name	Manager, office worker, slip-stick artist, customer service
P_se	Be called for short	Gaffer, webmaster, nurse, internal or office work
			P_el	Other non-duty name composition

Certainly the position name also has special a part of position name, and it refers to the special prefix of the special suffix and the characteristic speech of characteristic speech.Such as: " buying " is exactly the special suffix of characteristic speech in " senior buying "; " Shanghai branch office " is exactly the special prefix of characteristic speech in " the authentication consultant of Shanghai branch office ".And some abbreviations " gaffer ", " webmaster ", " nurse " etc.In addition, also can more nested other simple named entity nouns in the position name, such as: " language " in " Java language Developmental Engineer ".

For this situation, statistical method and rule and method are combined, adopt improved MR-GHMM and, better carry out named entity recognition to the recognition methods that many character representations method of Web named entity combines.The present invention utilizes this statistical frequency, is a composition of corpus separatum or other complicated entities on earth to those, distinguishes, adds up and discern.

Table 4 mechanism name mark sheet

Characteristic speech mark	Meaning	Example
			O_d	Hereinafter triggers characteristic	Google/ written examination notice
O_u	Preceding text trigger characteristic	Participation/Microsoft Research, Asia/interview time
			O_c	Connect and trigger characteristic	Xi'an Film Studio and Changchun Film Studio
O_e	Other triggers characteristic
			O_s	The general suffix of characteristic speech	Shanghai Manpower human resources company limited
O_ss	The special suffix of characteristic speech	General Office of the Central Committee, medicine inspection office
			O_p	The general prefix of characteristic speech	Shanghai Manpower human resources company limited
O_sp	The special prefix of characteristic speech	Founder Group
			O_l	The characteristic speech of mechanism's name	Company limited, enterprise, group
O_se	Be called for short	Beijing University, Shanghai Communications University
			O_el	Other non-mechanism name composition

Mechanism's name has been carried out the analysis of characteristic in the table 4, and general mechanism name is mainly begun by place name, and mechanism's name mark words finishes.Such as, " Shanghai " is exactly place name in " Shanghai Manpower human resources company limited ", and finishes with mechanism's mark words " Ltd ".Common mechanism's name mark words also has: group, office, the Room, institute, research institute etc.The words of this situation just can be discerned according to its prefix speech and mechanism identifier speech storehouse.Special organization names comprises the mechanism's name that begins with non-place name.

Model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; Such as " group ", " company " and simple entity such as " Shanghai " are marked.The second layer carries out complicated nested Entity recognition, with the input that annotation results is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition, such as: " Shanghai Manpower human resources company limited ".

In the table 4, provided the characteristic of mechanism's name, mechanism's name is mainly begun by place name, and mechanism's name mark words finishes.Such as, " Shanghai " is exactly place name in " Shanghai Manpower human resources company limited ", and finishes with mechanism's mark words " Ltd ".This type place name is defined as the general prefix of characteristic speech, and mechanism's mark words is defined as the general suffix of characteristic speech.Common mechanism's name characteristic speech also has: group, office, the Room, institute, research institute etc.

Mechanism's name also has a special part, refers to the special prefix of the special suffix and the characteristic speech of characteristic speech.Such as: " general office " is exactly the special suffix of characteristic speech in " General Office of the Central Committee "; " Founder " is exactly the special prefix of characteristic speech in " Founder Group ", because it is the mechanism's name that begins with non-place name.Other is called for short as " Beijing University ", " Shanghai Communications University " etc.

Train according to sample data, thereby use MFVSM to obtain many eigenvectors feature database of each functional block of the Web page

The structure that the present invention adopts the Web text and content characteristic are represented the characteristic of different objects.The present invention is with multidimensional eigenvector F _iCome the characteristic of indicated object.F wherein _iBy structural eigenvector And content feature vector

Form, specific as follows:

F _i＝[F _i ^c，F _i ^s]

{F_{i}}^{c} = [{f_{1}}^{c} (f_{11}^{c}, f_{12}^{c}), {f_{2}}^{c} (f_{21}^{c}, f_{22}^{c}), {f_{3}}^{c} (f_{31}^{c}, f_{32}^{c}), {f_{4}}^{c} (f_{41}^{c}, f_{42}^{c}),

{f_{5}}^{c} (f_{51}^{c}, f_{52}^{c}), . . .]

{F_{i}}^{s} = [{f_{1}}^{s} (f_{11}^{s}, f_{12}^{s}, f_{13}^{s}, f_{14}^{s}, f_{15}^{s}, f_{16}^{s}, f_{17}^{s}), {f_{2}}^{s} (f_{21}^{s}, f_{22}^{s}, f_{23}^{s}),

{f_{3}}^{s} (f_{31}^{s}, f_{32}^{s}, f_{33}^{s}, f_{34}^{s}, f_{35}^{s}), f^{4} . . .]

Each proper vector availability vector spatial model (VSM) representes that vector space model (VSM) is the method that a kind of text feature is represented.Therefore, the characteristic of Web object can be expressed as many characteristic vector spaces model (MFVSM).

Just, could describe the characteristic of Web entity more accurately, make that identification is more accurate, faster because the present invention has adopted the method for expressing of this many characteristics.

Embodiment 3:

With embodiment 1-2, test is to selecting the effects with MR-GHMM method single characteristic many characteristics to compare explanation based on the Web named entity recognition method of statistical model:

Recognition effect evaluating standard of the present invention is:

The present invention is an evaluating standard with recall ratio and precision ratio when comparing the recognition effect of different entities, considers precision ratio and recall ratio simultaneously, that is: the weighted geometric mean F of recall ratio and precision ratio.

(1) precision ratio equals the number of the number of system's generation correct option divided by all answers of system's generation.

(2) recall ratio equals number that system produces correct option divided by all possible answer number in the text (what comprise that system obtains should not ignore with system).

(3) F = \frac{(β^{2} + 1) PR}{β^{2} P + R}

(β gets 1 usually)

Through experiment choosing of feature selecting of the present invention and weight is described below.

At first test selecting effect many characteristics and MR-GHMM method single characteristic.Definition

is the weight of

, and

is the weight of

.Experimental result is seen table 5:

Table 5 is got the F value of the MR-GHMM of different weights property sets

	The weight value of attribute	F(％)
			1	α ^c＝1，α ^s＝0	66.1
2	α ^c＝0，α ^s＝1	67.3
			3	α ^c＝0.2，α ^s＝0.8	72.4
4	α ^c＝0.4，α ^s＝0.6	81.6
			5	α ^c＝0.5，α ^s＝0.5	78.7
6	α ^c＝0.9，α ^s＝0.1	76.5

Can find out through experiment; When the value of weight is 3,4,5 and 6; The accurate rate of named entity recognition obviously will be higher than 1,2 o'clock result, structure that this explanation the present invention proposes and the identification that more helps named entity of many character representations MFVSM of content than single characteristic.

The present invention adopts many character representations method of text and structure, is more suitable in the feature description of Web named entity, has improved Web named entity recognition precision.

Embodiment 4:

With embodiment 1-3, the method for setting up the MR-GHMM model among the present invention in the step 3 further specifies based on the Web named entity recognition method of statistical model:

The present invention adopts a kind of unified strategy---and the MR-GHMM based on characteristic speech mark discerns all kinds of named entities.The basic thought of characteristic speech mark is to formulate a cover characteristic speech label sets separately according to the formation and the word characteristics of all kinds of named entities.Take the Viterbi algorithm that the cutting result is carried out characteristic speech mark, on the basis of characteristic word sequence, carry out the automatic identification of simple Chinese named entity.

Calculate the parameter of MR-GHMM model, comprising:

A) N: Markov chain state number in the model.Remember that N state is S={S ₁, S ₂..., S _N, the note t residing state of Markov chain constantly is q _t, obvious q _t∈ S.

B) M: the possible observed value number that each state is corresponding.Remember that M observed value is V ₁～V _M, note t observed observed value constantly is o _t, o wherein _t∈ (V ₁～V _M).

C) π: original state probability vector, π ∈ (π ₁～π _N), π wherein _i=P (q ₁=S _i), 1≤i≤N.

MR-GHMM is only relevant with t state constantly to the state transition probability of t+1 state transitions constantly at t state constantly, and irrelevant with the state in any moment in the past; Can find out probability from (2) formula, only depend on the residing state of current time t and irrelevant with former history in t time output observed reading.But the observation output probability that arbitrary moment occurs not only depends on system's current state, and depends on system's previous moment state of living in.So the present invention supposes that t moment state not only depends on t state constantly to the probability of t+1 moment state transitions, and depends on t-1 state constantly.

a _ijk＝P(q _t+1＝S _k|q _t＝S _j，q _t-1＝S _i)，1≤i，j，k≤N (1)

Wherein

a _Ijk>=0, N is number of state in the representation model still.The probability of same characteristic measurement vector not only depends on the current state of living in of system, and depends on the residing state of system's previous moment.

b _Ij(k _s)=P (o _{T, s}=V _k| q _t=S _j, q _T-1=S _i), 1≤i, j≤N, 1≤k≤M (2) be for given model λ, task find out exactly make P (O, Q | λ) maximum state transitions sequence Q.

And to the Web named entity, MR-GHMM need consider its a plurality of attributes.MR-GHMM expands to property set k from single attribute ₁, k ₂..., k _Z, and these attribute linear combinations can be got

α wherein _sBe attribute k _sThe weights coefficient, and

0≤α _s≤1.

P (O, Q | λ) = π_{q_{1}} b_{q_{1}} (o_{1, s}) a_{q_{1} q_{2}} [Σ_{s = 1}^{Z} α_{s} \cdot b_{q_{1} q_{2}}^{s} (o_{2, s})] Π_{t = 3}^{T} (a_{q_{t - 2} q_{t - 1} q_{t}} [Σ_{s = 1}^{Z} α_{s} \cdot b_{q_{t - 1} q_{t}}^{s} (o_{t, s})]) - - - (3)

HMM is a kind of statistical model that in natural language processing field, is widely used.Consider more character representation thereby it is expanded to broad sense HMM GHMM, thereby be more suitable for Chinese named entity identification, the identification of complicated many feature entities such as name identification especially wherein, place name identification and mechanism's name identification.

Embodiment 5:

, passing through in rapid four among the present invention carried out the identification of Web named entity to each named entity mark and specifically introduce with embodiment 1-4 based on the Web named entity recognition method of statistical model.

The mark of Chinese named entity is similar to a simple part-of-speech tagging process.The present invention adopts is that the Viterbi algorithm carries out the characteristic speech and marks automatically.That is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results.It is theoretical and derive as follows: the Token sequence behind the supposition participle (being the preceding word segmentation result of unregistered word identification), T ⁿ=(t ₁, t ₂..., t _m) be G ⁿ=(g ₁, g ₂..., g _m) certain possible characteristic speech mark sequence.That when n gets 0, represent is low layer MR-GHMM, and it is high-rise MR-GHMM that n gets at 1 o'clock.Wherein is final annotation results, i.e. probability biggest characteristic word sequence.

According to Bayesian formula,

P (T^{n} | G^{n}) = \frac{P (T^{n}, G^{n})}{P (G^{n})}

Obtain:

logP(T ⁿ|G ⁿ)＝logP(T ⁿ)+logP(G ⁿ|T ⁿ) (1)

T^{#} = \underset{T}{\arg \max} \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (\log P (T^{n}) + \log P (G^{n} | T^{n})) - - - (2)

The assumed condition probability is independently:

P (G^{n} | T^{n}) = Π_{i = 1}^{n} P (g_{i} | t_{i})

Bringing formula 2 into just can obtain

T^{#} = \underset{T}{\arg \max} \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (Σ_{i = 1}^{n} \log P (g_{i} | t_{i}) +logP (T^{n})) - - - (3)

Adopt mutual information to replace the conditional probability hypothesis.Suppose that mutual information is separate, thereby second in the formula 4 pushed over, can obtain:

MI (T^{n}, G^{n}) = Σ_{i = 1}^{n} MI (t_{i}, G^{n}) - - - (4)

\log \frac{P (T^{n}, G^{n})}{P (T^{n}) \cdot P (G^{n})} = Σ_{i = 1}^{n} \log \frac{P (t_{i}, G^{n})}{P (t_{i}) \cdot P (G^{n})} - - - (5)

Perhaps write as:

\log P (T^{n} | G^{n}) - \log P (T^{n}) = Σ_{i = 1}^{n} \log P (t_{i} | G^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) - - - (6)

Obtain:

\log P (T^{n} | G^{n}) = \log P (T^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) + Σ_{i = 1}^{n} \log P (t_{i} | G^{n}) - - - (7)

So be exactly to ask:

T^{#} = \underset{T}{\arg \max} \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (\log P (T^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) + Σ_{i = 1}^{n} \log P (t_{i} | G^{n})) - - - (8)

The first of this calculating formula can calculate through the chain type rule, N-1 mark before the appearance of each mark in the n-gram model is depended on by hypothesis.Second portion be all independent markings the probability logarithm with.Third part is then relevant with observation sequence.In order to solve floating number underflow problem, avoid the appearance of zero probability, formula has adopted logarithm and smoothing algorithm, has also accelerated computing velocity simultaneously.Introduce finding the solution of every part below respectively.

For the P (T in the formula (8) ⁿ) calculating the present invention adopt natural language processing (n-gram language model) based on probability statistics, calculate a sentence T ⁿ=(t ₁, t ₂..., t _m) probability, according to the chain type rule be:

P (T^{n}) = P (t_{1}) Π_{i}^{m} P (t_{i} | t_{1}, t_{2}, . . ., t_{i - 1}) - - - (9)

And in reality, because the problem of the sparse property of data can not go to calculate according to this formula.With position example by name, such as: " sell and coordinate the assistant manager " designer is the characteristic speech of most critical in this position name, and these two nouns of " sale " " coordination " of front are modified " assistant manager " exactly, form a position name jointly.So feasible scheme is to suppose P (t _i| t ₁, t ₂..., t _I-1) only depend on N speech, i.e. (t of front _I-N+1, t _I-N+2..., t _I-1).Specifically, N=0 (context-free grammar) is arranged, N=1, N=2, and N=3.According to the characteristic that extracts entity, the model that uses as ternary.

For the computing method of , the present invention adopts a kind of improved back-off model to calculate.

The expression of improved back-off model is following:

P_{bo} (h | h^{'}, g) = \{\begin{matrix} P_{GT} (h | h^{'}, g) & if & C (h, h^{'}, g) > 0 \\ α (h^{'}, g) P_{bo} (h | h^{'}) & otherwise \end{matrix} - - - (10)

P _Bo(h|h ') is the new probability formula of three gram language model, wherein h=t _I-n+1... t _I-1, h '=t _I-n+2... t _I-1P _GT(h|h ', be that language model is carried out the smoothing processing algorithm g).Perhaps, many possible word sequences are not collected in the training corpus.If P _Bo(h|h ', be 0 g), the probability of so whole sentence has been 0 just also.Therefore must carry out smoothing processing to language model.So-called level and smooth, exactly the probability of known event is told sub-fraction, evenly give unknown incident, just the occurrence number incident that equals 0.Use the Good-Turing smoothing algorithm.

P_{GT} (h | h^{'}, g) = \frac{C (h, h^{'}, g)}{C (h^{'}, g)} - - - (11)

C_{GT} (h, h^{'}, g) = (C (h, h^{'}, g) + 1) \times \frac{N (C (h, h^{'}, g) + 1)}{N (C (h, h^{'}, g))} - - - (12)

α (h^{'}, g) = \frac{β (h^{'}, g)}{\underset{sC (h^{'}, h, g) = 0}{Σ} p (h | h^{'})} = \frac{β (h^{'}, g)}{1 - \underset{s . C (h^{'}, h, g) > 0}{Σ} p (h | h^{'})} - - - (13)

β (h^{'}, g) = 1 - \underset{s \cdot C (h^{'}, h, g) > 0}{Σ} p_{GT} (h | h^{'}, g) - - - (14)

The present invention takes the Viterbi algorithm that the cutting result is carried out characteristic speech mark (a similar simple part-of-speech tagging process), on the basis of characteristic word sequence, carries out simple pattern-recognition, finally realizes the automatic identification of Chinese named entity.

Embodiment 6:

Based on the Web named entity recognition method of statistical model with embodiment 1-5; The present invention combines statistical method and rule and method; Adopt improved MR-GHMM and carry out the named entity recognition experiment, named entity recognition mainly is to specific Web information extraction, for recruitment information to the recognition methods that many character representations method of Web named entity combines; Mainly be the extraction of recruitment information, i.e. the extraction of position name, mechanism's name (exabyte), place name, time entity.

To the recruitment webpage of six kinds of occupations, the MR-GHMM model that adopts the present invention to propose carries out the identification of Web named entity respectively in the present invention.The weighted geometric mean F that still adopts recall ratio and precision ratio and recall ratio and precision ratio is as the evaluation and test benchmark, and experimental result is respectively shown in following table 6, table 7 and table 8.

Table 6 is based on the precision ratio tables of data of the named entity recognition of MR-GHMM

Table 7 is based on the recall ratio tables of data of the named entity recognition of MR-GHMM

Table 8 is based on the F Value Data table of the named entity recognition of MR-GHMM

In table 6 and table 7, provided the experimental result data of the present invention respectively to the recall ratio and the precision ratio of Web named entity recognition.Just provide the weighting F value of table 7 and table 6 recall ratio and precision ratio at table 8.Experimental result shows: longitudinal comparison, and the MR-GHMM model is for " place " and simple relatively attribute such as " time " characteristic, and the accuracy rate of identification is almost near 100%.But the discrimination for " position name " and " mechanism's name " is relatively low.Because in these three types of entities, contextual feature speech more complicated, and the problem that entity is nested and be called for short is arranged is so the accuracy rate when causing discerning is not too high.Lateral comparison, the MR-GHMM model wants high for the object discrimination of " biomedicine " and " computing machine " class than the webpage of other classifications, and that is because the technical term of these two classifications compares standard, so than other webpages meeting identification easily.In leaching process, all exist in mark sheet through general place name after the training stage, so the accuracy rate of extracting is higher comparatively speaking to the place name entity.But concerning position name and mechanism's name entity, the title of some mechanisms is normally named to abridge.So, possibly there is not the organization names in the test phase in the training stage gained mark sheet, performance is relatively low.

P, R and the F of HMM, CRFs and three kinds of methods of MR-GHMM compared in following experiment respectively, and be as shown in table 9.Tested HMM and CRFs respectively, the recognition result of the algorithm that proposes with the present invention is compared.

Three kinds of abstracting methods of table 9 extract result's tables of data relatively

Experimental result shows, adopts the present invention to discern " position name " and " mechanism's name " these two kinds of entities, and its accuracy rate is obviously than the height of other two kinds of models.Because the present invention had both considered the architectural feature of Web named entity in the MR-GHMM model, considered its content and semantic feature again, be applicable to the named entity in the identification webpage more.The present invention adopts the extraction accuracy rate of the model of MR-GHMM to " place " and " time " entity; Compare raising with all the other two models little; Because in leaching process to place name; All exist in mark sheet through general place name after the training stage, so the accuracy rate that extracts generally is more or less the same.But concerning " position name " entity nested with " mechanism's name " this type, especially just more obvious for comprising nested entity superiority of the present invention, because the present invention considers the concrete nested characteristic of complicated entity, discern through multilayered model.

The extraction mean value of the improved MR-GHMM of the present invention weighted geometric mean F of recall ratio and precision ratio and recall ratio and precision ratio on three kinds of attributes is illustrated in fig. 2 shown below:

Horizontal ordinate is represented the size of training sample set among Fig. 2, and ordinate is based on the F value of MR-GHMM.From upper curve figure, can find out, use 55% training sample equally, if training sample surpasses after 370, its recall ratio R and precision ratio P are just remaining unchanged thereafter basically.When handling more large data sets, needn't train a large amount of samples, thereby reduce the quantity of training instance, the symbolic number in the transfer sum that reduces state and the incoming symbol sequence, so improved the operational efficiency of system.

The present invention just can use with the named entity of the last different field of Web and discern as long as through the many character representations to the Web named entity, make amendment to the substance feature of different field.Such as expression, carry out products quotation, the obtaining of the information of related web pages such as production marketing to product entity; To the expression of medicine entity, carry out disease treatment, the obtaining of the information of related web pages such as the use of medicine;

The present invention is a kind of Web named entity recognition method based on statistical model, with structure and text feature the Web named entity is carried out many character representations; The present invention combines statistical method and rule and method, adopts improved MR-GHMM to optimize the efficient of training; Model with improving hidden Markov marks entity, to each named entity mark, realizes Entity recognition; Web complex named entities identifying is handled as two layers, complicated nested Entity recognition is carried out in the input that the annotation results of ground floor is handled as the second layer.The present invention compares with original recognizer, and the recognition accuracy of this algorithm has improved, and the time complexity of model training also significantly reduces.Through many character representations to the Web named entity, make amendment to the substance feature of different field, just can use with the named entity of the last different field of Web and discern.

Claims

1. Web named entity recognition method based on statistical model is characterized in that: said method comprising the steps of:

C. the applied probability statistic algorithm is set up the MR-GHMM model, utilizes original state probability, transfering state probability and the state of the Baum-Welch algorithm computation model of promoting to discharge probability, promptly solves the problem concerning study of MR-GHMM;

2. the Web named entity recognition method based on statistical model according to claim 1 is characterized in that: the named entity feature extraction in the said step 1.2 comprises the steps:

realizes the feature extraction of named entity.

3. the Web named entity recognition method based on statistical model according to claim 1 and 2 is characterized in that: the MR-GHMM model of setting up in the said step 1.3 comprises the steps:

C1. calculate the parameter of MR-GHMM model;

4. the Web named entity recognition method based on statistical model according to claim 3 is characterized in that: the identification of the Web named entity in the said step 1.4 comprises the steps:

T^{#} = \underset{T}{\arg ma} x \log P (T^{n} | G^{n}) = \underset{T}{\arg \max} (\log P (T^{n}) - Σ_{i = 1}^{n} \log P (t_{i}) + Σ_{i = 1}^{n} \log P (t_{i} | G^{n}))

D3. computing method for

P_{bo} (h | h^{'}, g) = \{\begin{matrix} P_{GT} (h | h^{'}, g) & if & C (h, h^{'}, g) > 0 \\ α (h^{'}, g) P_{bo} (h | h^{'}) & otherwise \end{matrix}

P wherein _Bo(h|h ') is the new probability formula of three gram language model, wherein h=t _I-n-1... t _I-1, h '=t _I-n+2... t _I-1