CN102314417A - Method for identifying Web named entity based on statistical model - Google Patents

Method for identifying Web named entity based on statistical model Download PDF

Info

Publication number
CN102314417A
CN102314417A CN201110284429A CN201110284429A CN102314417A CN 102314417 A CN102314417 A CN 102314417A CN 201110284429 A CN201110284429 A CN 201110284429A CN 201110284429 A CN201110284429 A CN 201110284429A CN 102314417 A CN102314417 A CN 102314417A
Authority
CN
China
Prior art keywords
named entity
web
model
entity
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110284429A
Other languages
Chinese (zh)
Inventor
王静
刘志镜
曲建铭
王燕
贺文华
王炜华
王纵虎
陈东辉
姚勇
朱旭东
赵辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201110284429A priority Critical patent/CN102314417A/en
Publication of CN102314417A publication Critical patent/CN102314417A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method for identifying a Web named entity based on a statistical model. The method comprises the following steps of: representing multiple characteristics of the Web named entity with structure and text characteristics; combining a statistical method with a rule method and adopting an improved MR-GHMM (MR-Generalized Hidden Markov Model) to increase the training efficiency; marking the entity with the improved GHMM, and marking each named entity to realize entity identification; and processing a Web complex named entity identifying process on two layers and performing complex nested entity identification by taking a marking result of a first layer as the input of second layer processing. Compared with an original identifying algorithm, the method has the advantages that: the identifying accuracy of an algorithm used in the method is increased, and the time complexity of model training is lowered greatly. By representing multiple characteristics of the Web named entity and modifying entity characteristics in different fields, named entities in different fields on Web can be identified.

Description

Web named entity recognition method based on statistical model
Technical field
The invention belongs to the natural language processing technique field, relate generally to Web information extraction field, relate in particular to the Web named entity recognition.Specifically be a kind of Web named entity recognition method, mainly be used for identifying the Web named entity, realize obtaining and pre-service info web based on statistical model.
Background technology
Web named entity recognition technology is primarily aimed at the information of the Web page and carries out obtaining of master data.Thereby can discern the content of webpage through obtaining data, also be that follow-up various application such as information extraction, automatic question answering and translation thereof etc. all need the technological support of named entity recognition, and this also is groundwork in the natural language processing.In the network technology develop rapidly, and be widely used in the current of every field, extremely important to its research.In general, named entity recognition is exactly to one or more pending texts, identifies the named entity that wherein occurs, such as: name, place name, mechanism's name, Time of Day, numeral etc.
At present, English named entity recognition has been obtained good effect, its research and development are mainly concentrated on learning areas, comprised hidden Markov model, maximum entropy model and SVMs etc., some systems can practical application.When the 7th the comprehension of information meeting (MUC-7), best English named entity recognition system has reached 95% recall ratio and 92% precision ratio.Compare with English named entity recognition, the effect of Chinese named entity identification is also far short of what is expected.When second multilingual entity estimated meeting (MET-2), best Chinese named entity recognition system was respectively 66,89,89% at the precision ratio of name, place name, mechanism's name, and recall ratio is respectively 92,91,88%.
Chinese named entity identification at present mainly is based on rule and these two kinds of methods of statistics on method.The mode that rule-based method generally adopts tagged word or characteristic speech to trigger is carried out named entity recognition.Method based on statistics is mainly passed through, and extensive corpus named entity and context thereof are carried out statistical study, makes up statistical model and carries out named entity recognition.
Early stage Chinese named entity model of cognition comprises several submodels, and each submodel is handled a certain type of entity, possibly use rule-based method like the identification to name, possibly use the method based on statistics to the identification of place name, mechanism's name.For example, hidden Markov model, probability CFG, language model, maximum entropy language model, conditional random field models etc. based on decision tree.Just occurred various improved models subsequently, different entities has been handled with unified model.
Traditional recognition method do not consider the entity of discerning some display structure characteristics in Web, like this will be not comprehensive for the character representation of Web entity.More in addition, traditional recognition method can be set up different models to different entities to the identification of Web named entity, can't be that a this situation of composition of corpus separatum or other complicated entities is handled on earth to those so just.Setting up a plurality of models simultaneously also can increase the time complexity of identification greatly.Last point, classic method need a large amount of text datas in training process, so model too relies on the size of training text collection.The time that existing named entity recognition model spends on training sample is too big.
Better to the simple entity recognition effect in the identification of Chinese named entity at present, to complicated entity, especially for nested complicated entity, recognition efficiency and accuracy rate are lower.
Project team of the present invention does not find report or the document closely related and the same with the present invention more as yet to domestic and international patent documentation and the journal article retrieval of publishing.
Summary of the invention
The present invention is a kind of named entity recognition method based on statistical model, mainly is that the Web document is carried out pre-service, is the information extraction of back, and mechanical translation and question answering system provide basic guarantee.The named entity that the present invention is primarily aimed on the Web utilizes statistical model to carry out named entity recognition.The subject matter that the present invention will solve is the identification of existing Web Chinese named entity, and is especially not high enough to the accuracy of identification of complicated entity, accurate inadequately problem.
Be elaborated in the face of the present invention down
The present invention is a kind of Web named entity recognition method based on statistical model, it is characterized in that: said method comprising the steps of:
A. the original language material of Web text being carried out the pre-service of participle, and urtext is mapped on the abstract symbol collection, is machine learning afterwards, carries out the symbolism description of text and prepares;
B. named entity is set up the corresponding structure characteristic and text feature is represented, set up the feature database of named entity, use the character representation method of many eigenvectors of MFVSM, each named entity of the Web page is carried out feature extraction;
C. the applied probability statistic algorithm is set up the MR-GHMM model, utilizes Baum-Welch algorithm computation original state probability, transfering state probability and the state promoted to discharge probability, promptly solves the problem concerning study of MR-GHMM;
D. the many characteristics that combine the Web named entity; A kind of improved back-off model is introduced in the calculating of GHMM model; Adopt the Viterbi algorithm from all possible mark sequence, to optimize the maximum mark sequence of probability as final annotation results; And, realize being applicable to the Web named entity recognition of many characteristics to each named entity mark;
The E.MR-GHMM model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; The second layer carries out complicated nested Entity recognition, utilizes MR-GHMM that its transition probability is calculated, and with the input that the annotation results of ground floor is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition.
In existing method, generally adopt single text feature to carry out the feature description of entity, the foundation of model simultaneously is just to corpus separatum.The time that existing named entity recognition model spends on training sample is too big.The maximum probability that the present invention is directed to the GHMM statistical model is found the solution and is improved, thereby optimizes the efficient of training.To the characteristics of Web entity, adopt the architectural feature of entity and many character representations that text feature carries out the Web named entity on the other hand, improve accuracy of identification.
Realization of the present invention also is: the named entity feature extraction among the step B comprises the steps:
B1. the display styles that at first webpage is carried out the Web named entity is represented, forms architectural feature vector
Figure BSA00000579401700031
B2. again the Web named entity of webpage is carried out text feature and represent, convert text feature into a limited eigenvector
Figure BSA00000579401700032
B3. train according to sample data, use MFVSM to carry out many eigenvectors character representation of each named entity of the Web page: realizes the feature extraction of named entity.
The present invention combines the architectural feature of Web text and text feature and carries out many character representations of entity, thereby can more comprehensively express the characteristic of entity in the Web text.For follow-up Entity recognition is laid a solid foundation.
Realization of the present invention also is: the MR-GHMM model of setting up among the said step C comprises the steps:
C1. calculate the parameter of MR-GHMM model;
C2. according to setting up good character representation in the feature database, original expectation is trained, obtained the transition probability of named entity, thereby obtain the probability P of model;
C3. for given model λ, find out the state transitions sequence Q that makes P (O, Q| λ) maximum.
HMM is a kind of statistical model that in natural language processing field, is widely used.Consider more character representation thereby it is expanded to broad sense HMM GHMM, thereby be more suitable for the identification of the complicated many feature entities such as name identification, place name identification and mechanism's name identification in the Chinese named entity identification.
Realization of the present invention also is: the identification of the Web named entity among the said step D comprises the steps:
D1. carry out the characteristic speech with the Viterbi algorithm and mark automatically, that is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results;
D2. for P (T n) calculating adopt natural language processing (n-gram language model) based on probability statistics, calculate a sentence T n=(t 1, t 2..., t m) probability:
T # = arg ma T x log P ( T n | G n ) = arg max T ( log P ( T n ) - Σ i = 1 n log P ( t i ) + Σ i = 1 n log P ( t i | G n ) )
T wherein n=(t 1, t 2..., t m) be G n=(g 1, g 2..., g m) certain possible characteristic speech mark sequence;
D3. for the computing method of
Figure BSA00000579401700042
, adopt a kind of improved back-off model to calculate.The expression of improved back-off model is following:
P bo ( h | h ′ , g ) = P GT ( h | h ′ , g ) if C ( h , h ′ , g ) > 0 α ( h ′ , g ) P bo ( h | h ′ ) otherwise
P wherein Bo(h|g ') is the new probability formula of three gram language model, wherein h=t I-n+1... t I-1, h '=t I-n+2... t I-1
Through utilizing many characteristics of Web named entity; The present invention introduces the GHMM model and comes the Web named entity is set up probability model; In the process of Model Identification, the present invention introduces kind of an improved back-off model and comes the computation complexity of Optimization Model, thereby improves the efficient of Web named entity recognition.
Realization of the present invention also is: the MR-GHMM model is handled Web named entity recognition process as two layers, and ground floor carries out the simple entity mark; The second layer carries out complicated nested Entity recognition, utilizes MR-GHMM that its transition probability is calculated, and with the input that the annotation results of ground floor is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition.
The present invention combines statistical method and rule and method, adopts improved MR-GHMM and carries out named entity recognition to the recognition methods that many character representations method of Web named entity combines.Model is divided into two-layer, realizes identification complex named entities.
In general, many character representations of the Web named entity recognition of the present invention's proposition and improved statistical model come the better identification of realization Web named entity.Because this method that the present invention proposes has overcome some shortcomings of classic method, thereby has better realized the identification of Web named entity, also improve for identification efficiency simultaneously.
Compared with prior art, the present invention's advantage specific as follows:
(1) classic method is not taken into account the architectural feature of Web when carrying out character representation, and the present invention combines the text feature of the architectural feature of Web and considers, more comprehensively carries out the feature description of Web entity.Thereby improve the accuracy rate of Web named entity recognition.
(2) classic method will be to the single entity modeling; The method that the present invention adopts multilayer hidden Markov model and substance feature epiphase to combine; The identification of various different named entities is placed under the united frame, adopts two layer models that complex named entities is discerned, can discern more efficiently.
(3) in modeling process, the present invention introduces kind of an improved back-off model and comes the computation complexity of Optimization Model, and the text of binding entity and architectural feature carry out the character representation of named entity, thus the efficient of raising Web named entity recognition.
(4) the present invention combines statistical method and rule and method, adopts improved MR-GHMM and to the recognition methods that many character representations method of Web named entity combines, better carries out named entity recognition.
Description of drawings:
Fig. 1 is that the present invention adopts MR-GHMM to carry out the named entity recognition schematic flow sheet;
Fig. 2 the present invention is directed to the curve map of quantity different training sample to the extraction performance F value of named entity;
Embodiment:
Below in conjunction with accompanying drawing the present invention is elaborated:
Embodiment 1:
The present invention is a kind of named entity recognition method based on statistical model, mainly is that the Web document on the webpage is carried out pre-service, is the information extraction of back, and mechanical translation and question answering system provide basic guarantee.
With the recruitment website is example, and the recruitment information that the present invention is directed on the Web utilizes statistical model to carry out named entity recognition, and the named entity in the recruitment information mainly is place, time, mechanism and four types of entities of position, and the experiment flow of identification is as shown in Figure 1.Experimental data comes from Zhaopin.com's page or leaf in this routine table 1, chooses to comprise computing machine biomedicine, building industry, environmental protection, mechanical chemical industry, six types of recruitments such as secretarial webpage.Respectively these webpages are carried out the position name, the recruitment organization names, work place and the entity of recruitment time extract.Adopt the recognition methods of improved MR-GHMM to carry out the identification of entity.
Table 1 experiment sample collection
To the nested characteristics of position and mechanism's name, the present invention identifies simple named entity earlier on word order cutting result's basis, and the result who again MR-GHMM is discerned passes to high-rise MR-GHMM and realizes nested named entity recognition.In this process, the present invention adopts the method based on the Chinese named entity identification of many mark sheets.Whole identification process is as shown in Figure 1.
Recognition methods of the present invention may further comprise the steps:
A. the original language material of Web text is carried out the pre-service of participle; Carry out participle according to basic dictionary, the symbolism pre-service is carried out in original expectation, and urtext is mapped on the abstract symbol collection; Be machine learning afterwards, carry out the symbolism description of text and prepare.
B. named entity is set up the corresponding structure characteristic and text feature is represented, set up the feature database of named entity, use the character representation method of many eigenvectors of MFVSM, each named entity of the Web page is carried out feature extraction.
B1. the display styles that at first webpage is carried out the Web named entity is represented, forms the architectural feature that architectural feature vector
Figure BSA00000579401700061
obtains the Web named entity.
B2. again the Web named entity of webpage is carried out text feature and represent, convert text feature into text feature that a limited eigenvector
Figure BSA00000579401700062
obtains the Web named entity.
B3. train according to sample data, use MFVSM to carry out many eigenvectors character representation of each named entity of the Web page:
Figure BSA00000579401700063
realizes the feature extraction of named entity.
C. the applied probability statistic algorithm is set up the MR-GHMM model, utilizes Baum-Welch algorithm computation original state probability, transfering state probability and the state promoted to discharge probability, promptly solves the problem concerning study of MR-GHMM;
C1. calculate the parameter of MR-GHMM model:
C2. according to setting up good character representation in the feature database, original expectation is trained, obtained the transition probability of named entity, thereby obtain the probability P of model;
C3. for given model λ, find out the state transitions sequence Q that makes P (O, Q| λ) maximum.
D. the many characteristics that combine the Web named entity; A kind of improved back-off model is introduced in the calculating of GHMM model; Adopt the Viterbi algorithm from all possible mark sequence, to optimize the maximum mark sequence of probability as final annotation results; And, realize being applicable to the Web named entity recognition of many characteristics to each named entity mark;
D1. carry out the characteristic speech with the Viterbi algorithm and mark automatically, that is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results;
D2. for P (T n) calculating adopt natural language processing based on probability statistics, calculate a sentence T n=(t 1, t 2..., t m) probability:
T # = arg ma T x log P ( T n | G n ) = arg max T ( log P ( T n ) - Σ i = 1 n log P ( t i ) + Σ i = 1 n log P ( t i | G n ) )
T wherein n=(t 1, t 2..., t m) be G n=(g 1, g 2..., g m) certain possible characteristic speech mark sequence;
D3. computing method for
Figure BSA00000579401700071
; Adopt a kind of improved back-off model to calculate, the expression of improved back-off model is following:
P bo ( h | h ′ , g ) = P GT ( h | h ′ , g ) if C ( h , h ′ , g ) > 0 α ( h ′ , g ) P bo ( h | h ′ ) otherwise
P wherein Bo(h|h ') is the new probability formula of three gram language model, wherein h=t I-n+1... t I-1, h '=t I-n+2... t I-1
E. model is handled Web named entity recognition process as two layers; Ground floor carries out the simple entity mark; With the input of annotation results as second layer processing; Utilize GHMM that its transition probability is calculated, be transported to the identification GHMM of mechanism to the place name that identifies as mechanism's name class, thereby identify mechanism.Specifically be divided into following two steps:
E1. identify earlier the named entity of date entity, place name entity and simple non-nesting respectively; Identify wherein date and place name simple entity earlier; Mark; All date names that are about to identify convert into < DATA >, DATA >, and place name convert into < LOC >, LOC >, obtaining the phase one is the text marking collection of ground floor;
E2. on the basis of text mark collection; It is the text marking collection of ground floor; According to the character representation of position name and mechanism's name, utilize the second layer model of MR-GHMM to carry out mark, and all position names that will identify convert into < POS >, POS >; Mechanism's name convert into < ORG >, ORG >, thus accomplish the identification of all named entities.
Webpage to recruitment information carries out named entity recognition.Concerning the applicant, can better, more comprehensively obtain talent market on the one hand; Be concerned about simultaneously the unit and the researcher of education, obtain market feedback and also have certain directive significance for the arrangement of its subject for colleges and universities, culture units etc.
Embodiment 2:
With embodiment 1, the named entity feature extraction among the present invention in the step 2 further specifies based on the Web named entity recognition method of statistical model:
(1) the architectural feature vector of Web named entity is analyzed as follows:
Because the named entity in the webpage can show with the mode of stressing usually, so in identification, just can these characteristics be taken into account.For example, the position name shows that with the large size red font display mode just obviously is different from other text; These characteristics of Web named entity mainly are to be used for stressing some important information, also are the user friendly requirements of browsing simultaneously.
The display styles that at first webpage is carried out the Web named entity is represented, forms eigenvector
Figure BSA00000579401700074
What architectural feature referred to is exactly the display styles of Web object, and single (CSS) attribute of CSS of introducing Web is described the architectural feature of Web.Through the physical training, to obtain physical properties of the structural characteristics, as shown in particular in Table 2, such as the font style including font type
Figure BSA00000579401700082
Font Size
Figure BSA00000579401700083
Font Style
Figure BSA00000579401700084
font weight
Figure BSA00000579401700085
and font color
Figure BSA00000579401700086
Text Styles
Figure BSA00000579401700087
, including text-decoration
Figure BSA00000579401700088
first paragraph spaces
Figure BSA00000579401700089
and horizontal alignment
Figure BSA000005794017000810
Background Styles
Figure BSA000005794017000811
including the background color
Figure BSA000005794017000812
Background Image background-repeat
Figure BSA000005794017000814
Background fixed
Figure BSA000005794017000815
and the background positioning
Figure BSA000005794017000816
Table 2 Web named entity architectural feature
Figure BSA000005794017000817
Architectural feature is introduced in the feature description of Web entity, reflection Web entity that more can be concrete is different from some characteristics of traditional plain text, for the Web named entity recognition provides more effective feature description.
(2) the text feature vector
Figure BSA000005794017000818
of Web named entity is analyzed as follows:
The contextual feature of the entity object just that in fact text feature refers to, definition is referring to table 3 and table 4.In the identifying, only need certain speech as the probability of characteristic speech and the transition probability between the characteristic speech.Choosing not only of all kinds of named entity signature collection need scientifically be set in conjunction with expertise according to himself characteristic, but also will label sets be adjusted through experiment constantly.
The present invention takes all factors into consideration text and architectural feature, make its characteristic be more suitable in the Web entity and represent, thereby more accurate for the identification of web named entity.
Again the Web named entity is carried out text feature and represent, convert text feature into a limited eigenvector<img file="BSA000005794017000819.GIF" he="44" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="71" />If g<sub >i</sub>=<f<sub >i</sub>, w<sub >i</sub>>, F wherein<sup >n</sup>=(f<sub >1</sub>, f<sub >2</sub>..., f<sub >n</sub>) be the characteristic sequence of speech, and W<sup >n</sup>=(w<sub >1</sub>, w<sub >2</sub>..., w<sub >n</sub>) be the sequence of speech.For characteristic f sequence, the signature of being introduced below using.
Among the present invention, the position name is exactly a kind of special, complicated named entity.At first, the position name tends to include the speech of representing different work posts, like " slip-stick artist ", " teacher " etc.; Secondly, position name length is also fixing, and some position name reaches tens even tens words, but the abbreviation that has has only two words, length and border be difficult to confirm also to make the more difficult identification of position name, like " sale and after-sales-service engineer ", " skilled worker " etc.; In addition, tend to be nested with place name in the position name, like " Shanghai Manager of Branch ", the identification that also influences the position name nested against one another.Thus it is clear that, to the identification of position name and complicated equally to the identification of mechanism's name.
The MR-GHMM model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; Such as " slip-stick artist ", " teacher " and " Shanghai " etc. are marked as simple entity.When the second layer carries out complicated nested Entity recognition; With the input as second layer processing such as above-mentioned annotation results " slip-stick artist ", " teacher " and " Shanghai "; On the simple entity basis that ground floor identifies, carry out complicated nested Entity recognition, thereby identify " selling and after-sales-service engineer ".
With regard to recruitment information, the content of usually filling in the industry is normally wide in range, if in post information, can just can make specialty be suitable for information more accurately and accurate extracting and replenish.
In the table 3, provided the content characteristic of position name.The internal feature of position name; Expression position name mainly is to be made up of occupational title and academic title; Promptly constitute by the general prefix of characteristic speech and the general suffix of characteristic speech: such as; " environmental protection " is exactly the general prefix of characteristic speech in " green technology teacher ", and " technological teacher " then is the general suffix of characteristic speech, and common position name characteristic speech also has slip-stick artist, designer, manager, employee etc.
Table 3 position name mark sheet
Characteristic speech mark Meaning Example
P_d Hereinafter triggers characteristic Secretarial/some names
P_u Preceding text trigger characteristic Recruitment/Financial Assistant; Employ sincerely/the house property middle man; 4/security personnel
P_c Connect and trigger characteristic Assistant's recruitment and staff relationship manager
P_e Other triggers characteristic
P_s The general suffix of characteristic speech Financial executive, QA Manager, process engineer, sales director
P_ss The special suffix of characteristic speech Cashier, recreation customer service, senior buying
P_p The general prefix of characteristic speech Embedded system slip-stick artist, green technology person, realtor
P_sp The special prefix of characteristic speech The authentication consultant of Shanghai branch office
P_l The characteristic speech of position name Manager, office worker, slip-stick artist, customer service
P_se Be called for short Gaffer, webmaster, nurse, internal or office work
P_el Other non-duty name composition
Certainly the position name also has special a part of position name, and it refers to the special prefix of the special suffix and the characteristic speech of characteristic speech.Such as: " buying " is exactly the special suffix of characteristic speech in " senior buying "; " Shanghai branch office " is exactly the special prefix of characteristic speech in " the authentication consultant of Shanghai branch office ".And some abbreviations " gaffer ", " webmaster ", " nurse " etc.In addition, also can more nested other simple named entity nouns in the position name, such as: " language " in " Java language Developmental Engineer ".
For this situation, statistical method and rule and method are combined, adopt improved MR-GHMM and, better carry out named entity recognition to the recognition methods that many character representations method of Web named entity combines.The present invention utilizes this statistical frequency, is a composition of corpus separatum or other complicated entities on earth to those, distinguishes, adds up and discern.
Table 4 mechanism name mark sheet
Characteristic speech mark Meaning Example
O_d Hereinafter triggers characteristic Google/ written examination notice
O_u Preceding text trigger characteristic Participation/Microsoft Research, Asia/interview time
O_c Connect and trigger characteristic Xi'an Film Studio and Changchun Film Studio
O_e Other triggers characteristic
O_s The general suffix of characteristic speech Shanghai Manpower human resources company limited
O_ss The special suffix of characteristic speech General Office of the Central Committee, medicine inspection office
O_p The general prefix of characteristic speech Shanghai Manpower human resources company limited
O_sp The special prefix of characteristic speech Founder Group
O_l The characteristic speech of mechanism's name Company limited, enterprise, group
O_se Be called for short Beijing University, Shanghai Communications University
O_el Other non-mechanism name composition
Mechanism's name has been carried out the analysis of characteristic in the table 4, and general mechanism name is mainly begun by place name, and mechanism's name mark words finishes.Such as, " Shanghai " is exactly place name in " Shanghai Manpower human resources company limited ", and finishes with mechanism's mark words " Ltd ".Common mechanism's name mark words also has: group, office, the Room, institute, research institute etc.The words of this situation just can be discerned according to its prefix speech and mechanism identifier speech storehouse.Special organization names comprises the mechanism's name that begins with non-place name.
Model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; Such as " group ", " company " and simple entity such as " Shanghai " are marked.The second layer carries out complicated nested Entity recognition, with the input that annotation results is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition, such as: " Shanghai Manpower human resources company limited ".
In the table 4, provided the characteristic of mechanism's name, mechanism's name is mainly begun by place name, and mechanism's name mark words finishes.Such as, " Shanghai " is exactly place name in " Shanghai Manpower human resources company limited ", and finishes with mechanism's mark words " Ltd ".This type place name is defined as the general prefix of characteristic speech, and mechanism's mark words is defined as the general suffix of characteristic speech.Common mechanism's name characteristic speech also has: group, office, the Room, institute, research institute etc.
Mechanism's name also has a special part, refers to the special prefix of the special suffix and the characteristic speech of characteristic speech.Such as: " general office " is exactly the special suffix of characteristic speech in " General Office of the Central Committee "; " Founder " is exactly the special prefix of characteristic speech in " Founder Group ", because it is the mechanism's name that begins with non-place name.Other is called for short as " Beijing University ", " Shanghai Communications University " etc.
Train according to sample data, thereby use MFVSM to obtain many eigenvectors feature database of each functional block of the Web page
The structure that the present invention adopts the Web text and content characteristic are represented the characteristic of different objects.The present invention is with multidimensional eigenvector F iCome the characteristic of indicated object.F wherein iBy structural eigenvector And content feature vector
Figure BSA00000579401700113
Form, specific as follows:
F i=[F i c,F i s]
F i c = [ f 1 c ( f 11 c , f 12 c ) , f 2 c ( f 21 c , f 22 c ) , f 3 c ( f 31 c , f 32 c ) , f 4 c ( f 41 c , f 42 c ) ,
f 5 c ( f 51 c , f 52 c ) , . . . ]
Figure BSA00000579401700116
F i s = [ f 1 s ( f 11 s , f 12 s , f 13 s , f 14 s , f 15 s , f 16 s , f 17 s ) , f 2 s ( f 21 s , f 22 s , f 23 s ) ,
f 3 s ( f 31 s , f 32 s , f 33 s , f 34 s , f 35 s ) , f 4 . . . ]
Each proper vector availability vector spatial model (VSM) representes that vector space model (VSM) is the method that a kind of text feature is represented.Therefore, the characteristic of Web object can be expressed as many characteristic vector spaces model (MFVSM).
Just, could describe the characteristic of Web entity more accurately, make that identification is more accurate, faster because the present invention has adopted the method for expressing of this many characteristics.
Embodiment 3:
With embodiment 1-2, test is to selecting the effects with MR-GHMM method single characteristic many characteristics to compare explanation based on the Web named entity recognition method of statistical model:
Recognition effect evaluating standard of the present invention is:
The present invention is an evaluating standard with recall ratio and precision ratio when comparing the recognition effect of different entities, considers precision ratio and recall ratio simultaneously, that is: the weighted geometric mean F of recall ratio and precision ratio.
(1) precision ratio equals the number of the number of system's generation correct option divided by all answers of system's generation.
(2) recall ratio equals number that system produces correct option divided by all possible answer number in the text (what comprise that system obtains should not ignore with system).
( 3 ) F = ( &beta; 2 + 1 ) PR &beta; 2 P + R (β gets 1 usually)
Through experiment choosing of feature selecting of the present invention and weight is described below.
At first test selecting effect many characteristics and MR-GHMM method single characteristic.Definition
Figure BSA00000579401700121
is the weight of
Figure BSA00000579401700122
, and
Figure BSA00000579401700123
is the weight of
Figure BSA00000579401700124
.Experimental result is seen table 5:
Table 5 is got the F value of the MR-GHMM of different weights property sets
The weight value of attribute F(%)
1 α c=1,α s=0 66.1
2 α c=0,α s=1 67.3
3 α c=0.2,α s=0.8 72.4
4 α c=0.4,α s=0.6 81.6
5 α c=0.5,α s=0.5 78.7
6 α c=0.9,α s=0.1 76.5
Can find out through experiment; When the value of weight is 3,4,5 and 6; The accurate rate of named entity recognition obviously will be higher than 1,2 o'clock result, structure that this explanation the present invention proposes and the identification that more helps named entity of many character representations MFVSM of content than single characteristic.
The present invention adopts many character representations method of text and structure, is more suitable in the feature description of Web named entity, has improved Web named entity recognition precision.
Embodiment 4:
With embodiment 1-3, the method for setting up the MR-GHMM model among the present invention in the step 3 further specifies based on the Web named entity recognition method of statistical model:
The present invention adopts a kind of unified strategy---and the MR-GHMM based on characteristic speech mark discerns all kinds of named entities.The basic thought of characteristic speech mark is to formulate a cover characteristic speech label sets separately according to the formation and the word characteristics of all kinds of named entities.Take the Viterbi algorithm that the cutting result is carried out characteristic speech mark, on the basis of characteristic word sequence, carry out the automatic identification of simple Chinese named entity.
Calculate the parameter of MR-GHMM model, comprising:
A) N: Markov chain state number in the model.Remember that N state is S={S 1, S 2..., S N, the note t residing state of Markov chain constantly is q t, obvious q t∈ S.
B) M: the possible observed value number that each state is corresponding.Remember that M observed value is V 1~V M, note t observed observed value constantly is o t, o wherein t∈ (V 1~V M).
C) π: original state probability vector, π ∈ (π 1~π N), π wherein i=P (q 1=S i), 1≤i≤N.
MR-GHMM is only relevant with t state constantly to the state transition probability of t+1 state transitions constantly at t state constantly, and irrelevant with the state in any moment in the past; Can find out probability from (2) formula, only depend on the residing state of current time t and irrelevant with former history in t time output observed reading.But the observation output probability that arbitrary moment occurs not only depends on system's current state, and depends on system's previous moment state of living in.So the present invention supposes that t moment state not only depends on t state constantly to the probability of t+1 moment state transitions, and depends on t-1 state constantly.
a ijk=P(q t+1=S k|q t=S j,q t-1=S i),1≤i,j,k≤N (1)
Wherein
Figure BSA00000579401700131
a Ijk>=0, N is number of state in the representation model still.The probability of same characteristic measurement vector not only depends on the current state of living in of system, and depends on the residing state of system's previous moment.
b Ij(k s)=P (o T, s=V k| q t=S j, q T-1=S i), 1≤i, j≤N, 1≤k≤M (2) be for given model λ, task find out exactly make P (O, Q | λ) maximum state transitions sequence Q.
And to the Web named entity, MR-GHMM need consider its a plurality of attributes.MR-GHMM expands to property set k from single attribute 1, k 2..., k Z, and these attribute linear combinations can be got
Figure BSA00000579401700132
α wherein sBe attribute k sThe weights coefficient, and
Figure BSA00000579401700133
0≤α s≤1.
P ( O , Q | &lambda; ) = &pi; q 1 b q 1 ( o 1 , s ) a q 1 q 2 [ &Sigma; s = 1 Z &alpha; s &CenterDot; b q 1 q 2 s ( o 2 , s ) ] &Pi; t = 3 T ( a q t - 2 q t - 1 q t [ &Sigma; s = 1 Z &alpha; s &CenterDot; b q t - 1 q t s ( o t , s ) ] ) - - - ( 3 )
HMM is a kind of statistical model that in natural language processing field, is widely used.Consider more character representation thereby it is expanded to broad sense HMM GHMM, thereby be more suitable for Chinese named entity identification, the identification of complicated many feature entities such as name identification especially wherein, place name identification and mechanism's name identification.
Embodiment 5:
, passing through in rapid four among the present invention carried out the identification of Web named entity to each named entity mark and specifically introduce with embodiment 1-4 based on the Web named entity recognition method of statistical model.
The mark of Chinese named entity is similar to a simple part-of-speech tagging process.The present invention adopts is that the Viterbi algorithm carries out the characteristic speech and marks automatically.That is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results.It is theoretical and derive as follows: the Token sequence behind the supposition participle (being the preceding word segmentation result of unregistered word identification), T n=(t 1, t 2..., t m) be G n=(g 1, g 2..., g m) certain possible characteristic speech mark sequence.That when n gets 0, represent is low layer MR-GHMM, and it is high-rise MR-GHMM that n gets at 1 o'clock.Wherein is final annotation results, i.e. probability biggest characteristic word sequence.
According to Bayesian formula, P ( T n | G n ) = P ( T n , G n ) P ( G n ) Obtain:
logP(T n|G n)=logP(T n)+logP(G n|T n) (1)
T # = arg max T log P ( T n | G n ) = arg max T ( log P ( T n ) + log P ( G n | T n ) ) - - - ( 2 )
The assumed condition probability is independently: P ( G n | T n ) = &Pi; i = 1 n P ( g i | t i )
Bringing formula 2 into just can obtain
T # = arg max T log P ( T n | G n ) = arg max T ( &Sigma; i = 1 n log P ( g i | t i ) +logP ( T n ) ) - - - ( 3 )
Adopt mutual information to replace the conditional probability hypothesis.Suppose that mutual information is separate, thereby second in the formula 4 pushed over, can obtain:
MI ( T n , G n ) = &Sigma; i = 1 n MI ( t i , G n ) - - - ( 4 )
log P ( T n , G n ) P ( T n ) &CenterDot; P ( G n ) = &Sigma; i = 1 n log P ( t i , G n ) P ( t i ) &CenterDot; P ( G n ) - - - ( 5 )
Perhaps write as:
log P ( T n | G n ) - log P ( T n ) = &Sigma; i = 1 n log P ( t i | G n ) - &Sigma; i = 1 n log P ( t i ) - - - ( 6 )
Obtain:
log P ( T n | G n ) = log P ( T n ) - &Sigma; i = 1 n log P ( t i ) + &Sigma; i = 1 n log P ( t i | G n ) - - - ( 7 )
So be exactly to ask:
T # = arg max T log P ( T n | G n ) = arg max T ( log P ( T n ) - &Sigma; i = 1 n log P ( t i ) + &Sigma; i = 1 n log P ( t i | G n ) ) - - - ( 8 )
The first of this calculating formula can calculate through the chain type rule, N-1 mark before the appearance of each mark in the n-gram model is depended on by hypothesis.Second portion be all independent markings the probability logarithm with.Third part is then relevant with observation sequence.In order to solve floating number underflow problem, avoid the appearance of zero probability, formula has adopted logarithm and smoothing algorithm, has also accelerated computing velocity simultaneously.Introduce finding the solution of every part below respectively.
For the P (T in the formula (8) n) calculating the present invention adopt natural language processing (n-gram language model) based on probability statistics, calculate a sentence T n=(t 1, t 2..., t m) probability, according to the chain type rule be:
P ( T n ) = P ( t 1 ) &Pi; i m P ( t i | t 1 , t 2 , . . . , t i - 1 ) - - - ( 9 )
And in reality, because the problem of the sparse property of data can not go to calculate according to this formula.With position example by name, such as: " sell and coordinate the assistant manager " designer is the characteristic speech of most critical in this position name, and these two nouns of " sale " " coordination " of front are modified " assistant manager " exactly, form a position name jointly.So feasible scheme is to suppose P (t i| t 1, t 2..., t I-1) only depend on N speech, i.e. (t of front I-N+1, t I-N+2..., t I-1).Specifically, N=0 (context-free grammar) is arranged, N=1, N=2, and N=3.According to the characteristic that extracts entity, the model that uses as ternary.
For the computing method of , the present invention adopts a kind of improved back-off model to calculate.
The expression of improved back-off model is following:
P bo ( h | h &prime; , g ) = P GT ( h | h &prime; , g ) if C ( h , h &prime; , g ) > 0 &alpha; ( h &prime; , g ) P bo ( h | h &prime; ) otherwise - - - ( 10 )
P Bo(h|h ') is the new probability formula of three gram language model, wherein h=t I-n+1... t I-1, h '=t I-n+2... t I-1P GT(h|h ', be that language model is carried out the smoothing processing algorithm g).Perhaps, many possible word sequences are not collected in the training corpus.If P Bo(h|h ', be 0 g), the probability of so whole sentence has been 0 just also.Therefore must carry out smoothing processing to language model.So-called level and smooth, exactly the probability of known event is told sub-fraction, evenly give unknown incident, just the occurrence number incident that equals 0.Use the Good-Turing smoothing algorithm.
P GT ( h | h &prime; , g ) = C ( h , h &prime; , g ) C ( h &prime; , g ) - - - ( 11 )
C GT ( h , h &prime; , g ) = ( C ( h , h &prime; , g ) + 1 ) &times; N ( C ( h , h &prime; , g ) + 1 ) N ( C ( h , h &prime; , g ) ) - - - ( 12 )
&alpha; ( h &prime; , g ) = &beta; ( h &prime; , g ) &Sigma; sC ( h &prime; , h , g ) = 0 p ( h | h &prime; ) = &beta; ( h &prime; , g ) 1 - &Sigma; s . C ( h &prime; , h , g ) > 0 p ( h | h &prime; ) - - - ( 13 )
&beta; ( h &prime; , g ) = 1 - &Sigma; s &CenterDot; C ( h &prime; , h , g ) > 0 p GT ( h | h &prime; , g ) - - - ( 14 )
The present invention takes the Viterbi algorithm that the cutting result is carried out characteristic speech mark (a similar simple part-of-speech tagging process), on the basis of characteristic word sequence, carries out simple pattern-recognition, finally realizes the automatic identification of Chinese named entity.
Embodiment 6:
Based on the Web named entity recognition method of statistical model with embodiment 1-5; The present invention combines statistical method and rule and method; Adopt improved MR-GHMM and carry out the named entity recognition experiment, named entity recognition mainly is to specific Web information extraction, for recruitment information to the recognition methods that many character representations method of Web named entity combines; Mainly be the extraction of recruitment information, i.e. the extraction of position name, mechanism's name (exabyte), place name, time entity.
To the recruitment webpage of six kinds of occupations, the MR-GHMM model that adopts the present invention to propose carries out the identification of Web named entity respectively in the present invention.The weighted geometric mean F that still adopts recall ratio and precision ratio and recall ratio and precision ratio is as the evaluation and test benchmark, and experimental result is respectively shown in following table 6, table 7 and table 8.
Table 6 is based on the precision ratio tables of data of the named entity recognition of MR-GHMM
Figure BSA00000579401700164
Table 7 is based on the recall ratio tables of data of the named entity recognition of MR-GHMM
Figure BSA00000579401700165
Table 8 is based on the F Value Data table of the named entity recognition of MR-GHMM
Figure BSA00000579401700171
In table 6 and table 7, provided the experimental result data of the present invention respectively to the recall ratio and the precision ratio of Web named entity recognition.Just provide the weighting F value of table 7 and table 6 recall ratio and precision ratio at table 8.Experimental result shows: longitudinal comparison, and the MR-GHMM model is for " place " and simple relatively attribute such as " time " characteristic, and the accuracy rate of identification is almost near 100%.But the discrimination for " position name " and " mechanism's name " is relatively low.Because in these three types of entities, contextual feature speech more complicated, and the problem that entity is nested and be called for short is arranged is so the accuracy rate when causing discerning is not too high.Lateral comparison, the MR-GHMM model wants high for the object discrimination of " biomedicine " and " computing machine " class than the webpage of other classifications, and that is because the technical term of these two classifications compares standard, so than other webpages meeting identification easily.In leaching process, all exist in mark sheet through general place name after the training stage, so the accuracy rate of extracting is higher comparatively speaking to the place name entity.But concerning position name and mechanism's name entity, the title of some mechanisms is normally named to abridge.So, possibly there is not the organization names in the test phase in the training stage gained mark sheet, performance is relatively low.
P, R and the F of HMM, CRFs and three kinds of methods of MR-GHMM compared in following experiment respectively, and be as shown in table 9.Tested HMM and CRFs respectively, the recognition result of the algorithm that proposes with the present invention is compared.
Three kinds of abstracting methods of table 9 extract result's tables of data relatively
Figure BSA00000579401700172
Experimental result shows, adopts the present invention to discern " position name " and " mechanism's name " these two kinds of entities, and its accuracy rate is obviously than the height of other two kinds of models.Because the present invention had both considered the architectural feature of Web named entity in the MR-GHMM model, considered its content and semantic feature again, be applicable to the named entity in the identification webpage more.The present invention adopts the extraction accuracy rate of the model of MR-GHMM to " place " and " time " entity; Compare raising with all the other two models little; Because in leaching process to place name; All exist in mark sheet through general place name after the training stage, so the accuracy rate that extracts generally is more or less the same.But concerning " position name " entity nested with " mechanism's name " this type, especially just more obvious for comprising nested entity superiority of the present invention, because the present invention considers the concrete nested characteristic of complicated entity, discern through multilayered model.
The extraction mean value of the improved MR-GHMM of the present invention weighted geometric mean F of recall ratio and precision ratio and recall ratio and precision ratio on three kinds of attributes is illustrated in fig. 2 shown below:
Horizontal ordinate is represented the size of training sample set among Fig. 2, and ordinate is based on the F value of MR-GHMM.From upper curve figure, can find out, use 55% training sample equally, if training sample surpasses after 370, its recall ratio R and precision ratio P are just remaining unchanged thereafter basically.When handling more large data sets, needn't train a large amount of samples, thereby reduce the quantity of training instance, the symbolic number in the transfer sum that reduces state and the incoming symbol sequence, so improved the operational efficiency of system.
The present invention just can use with the named entity of the last different field of Web and discern as long as through the many character representations to the Web named entity, make amendment to the substance feature of different field.Such as expression, carry out products quotation, the obtaining of the information of related web pages such as production marketing to product entity; To the expression of medicine entity, carry out disease treatment, the obtaining of the information of related web pages such as the use of medicine;
The present invention is a kind of Web named entity recognition method based on statistical model, with structure and text feature the Web named entity is carried out many character representations; The present invention combines statistical method and rule and method, adopts improved MR-GHMM to optimize the efficient of training; Model with improving hidden Markov marks entity, to each named entity mark, realizes Entity recognition; Web complex named entities identifying is handled as two layers, complicated nested Entity recognition is carried out in the input that the annotation results of ground floor is handled as the second layer.The present invention compares with original recognizer, and the recognition accuracy of this algorithm has improved, and the time complexity of model training also significantly reduces.Through many character representations to the Web named entity, make amendment to the substance feature of different field, just can use with the named entity of the last different field of Web and discern.

Claims (4)

1. Web named entity recognition method based on statistical model is characterized in that: said method comprising the steps of:
A. the original language material of Web text being carried out the pre-service of participle, and urtext is mapped on the abstract symbol collection, is machine learning afterwards, carries out the symbolism description of text and prepares;
B. named entity is set up the corresponding structure characteristic and text feature is represented, set up the feature database of named entity, use the character representation method of many eigenvectors of MFVSM, each named entity of the Web page is carried out feature extraction;
C. the applied probability statistic algorithm is set up the MR-GHMM model, utilizes original state probability, transfering state probability and the state of the Baum-Welch algorithm computation model of promoting to discharge probability, promptly solves the problem concerning study of MR-GHMM;
D. the many characteristics that combine the Web named entity; A kind of improved back-off model is introduced in the calculating of GHMM model; Adopt the Viterbi algorithm from all possible mark sequence, to optimize the maximum mark sequence of probability as final annotation results; And, realize being applicable to the Web named entity recognition of many characteristics to each named entity mark;
The E.MR-GHMM model is handled Web named entity recognition process as two layers, ground floor carries out the simple entity mark; The second layer carries out complicated nested Entity recognition, utilizes MR-GHMM that its transition probability is calculated, and with the input that the annotation results of ground floor is handled as the second layer, on the simple entity basis that ground floor identifies, carries out complicated nested Entity recognition.
2. the Web named entity recognition method based on statistical model according to claim 1 is characterized in that: the named entity feature extraction in the said step 1.2 comprises the steps:
B1. the display styles that at first webpage is carried out the Web named entity is represented, forms architectural feature vector
Figure FSA00000579401600011
B2. again the Web named entity of webpage is carried out text feature and represent, convert text feature into a limited eigenvector
Figure FSA00000579401600012
B3. train according to sample data, use MFVSM to carry out many eigenvectors character representation of each named entity of the Web page:
Figure FSA00000579401600013
realizes the feature extraction of named entity.
3. the Web named entity recognition method based on statistical model according to claim 1 and 2 is characterized in that: the MR-GHMM model of setting up in the said step 1.3 comprises the steps:
C1. calculate the parameter of MR-GHMM model;
C2. according to setting up good character representation in the feature database, original expectation is trained, obtained the transition probability of named entity, thereby obtain the probability P of model;
C3. for given model λ, find out the state transitions sequence Q that makes P (O, Q| λ) maximum.
4. the Web named entity recognition method based on statistical model according to claim 3 is characterized in that: the identification of the Web named entity in the said step 1.4 comprises the steps:
D1. carry out the characteristic speech with the Viterbi algorithm and mark automatically, that is: from all possible mark sequence, optimize the maximum mark sequence of probability as final annotation results;
D2. for P (T n) calculating adopt natural language processing based on probability statistics, calculate a sentence T n=(t 1, t 2..., t m) probability:
T # = arg ma T x log P ( T n | G n ) = arg max T ( log P ( T n ) - &Sigma; i = 1 n log P ( t i ) + &Sigma; i = 1 n log P ( t i | G n ) )
T wherein n=(t 1, t 2..., t m) be G n=(g 1, g 2..., g m) certain possible characteristic speech mark sequence;
D3. computing method for
Figure FSA00000579401600022
; Adopt a kind of improved back-off model to calculate, the expression of improved back-off model is following:
P bo ( h | h &prime; , g ) = P GT ( h | h &prime; , g ) if C ( h , h &prime; , g ) > 0 &alpha; ( h &prime; , g ) P bo ( h | h &prime; ) otherwise
P wherein Bo(h|h ') is the new probability formula of three gram language model, wherein h=t I-n-1... t I-1, h '=t I-n+2... t I-1
CN201110284429A 2011-09-22 2011-09-22 Method for identifying Web named entity based on statistical model Pending CN102314417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110284429A CN102314417A (en) 2011-09-22 2011-09-22 Method for identifying Web named entity based on statistical model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110284429A CN102314417A (en) 2011-09-22 2011-09-22 Method for identifying Web named entity based on statistical model

Publications (1)

Publication Number Publication Date
CN102314417A true CN102314417A (en) 2012-01-11

Family

ID=45427600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110284429A Pending CN102314417A (en) 2011-09-22 2011-09-22 Method for identifying Web named entity based on statistical model

Country Status (1)

Country Link
CN (1) CN102314417A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609853A (en) * 2012-02-27 2012-07-25 蒋永 System and method for intelligently identifying names and models of commodities
CN103377186A (en) * 2012-04-26 2013-10-30 富士通株式会社 Web service integration device, method and equipment based on identity of named entity
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN105426464A (en) * 2015-11-13 2016-03-23 北大方正集团有限公司 Method and device for identifying named entities
CN105550227A (en) * 2015-12-07 2016-05-04 中国建设银行股份有限公司 Named entity identification method and device
CN103257983B (en) * 2012-09-10 2016-06-15 苏州大学 Deep Web entity identification method based on uniqueness constraint
CN106202054A (en) * 2016-07-25 2016-12-07 哈尔滨工业大学 A kind of name entity recognition method learnt based on the degree of depth towards medical field
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN106407183A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Method and device for generating medical named entity recognition system
CN106598950A (en) * 2016-12-23 2017-04-26 东北大学 Method for recognizing named entity based on mixing stacking model
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
WO2017097166A1 (en) * 2015-12-11 2017-06-15 北京国双科技有限公司 Domain named entity recognition method and apparatus
CN106997342A (en) * 2017-03-27 2017-08-01 上海奔影网络科技有限公司 Intension recognizing method and device based on many wheel interactions
CN107870966A (en) * 2017-08-11 2018-04-03 成都萌想科技有限责任公司 A kind of recruitment general regulations data pick-up method based on semantic model
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN108509423A (en) * 2018-04-04 2018-09-07 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
CN109522553A (en) * 2018-11-09 2019-03-26 龙马智芯(珠海横琴)科技有限公司 Name recognition methods and the device of entity
CN110210023A (en) * 2019-05-23 2019-09-06 竹间智能科技(上海)有限公司 A kind of calculation method of practical and effective name Entity recognition
CN110866402A (en) * 2019-11-18 2020-03-06 北京香侬慧语科技有限责任公司 Named entity identification method and device, storage medium and electronic equipment
WO2020132851A1 (en) * 2018-12-25 2020-07-02 Microsoft Technology Licensing, Llc Date extractor
CN111767733A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Document security classification discrimination method based on statistical word segmentation
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112818667A (en) * 2021-01-29 2021-05-18 上海寻梦信息技术有限公司 Address correction method, system, device and storage medium
CN113033207A (en) * 2021-04-07 2021-06-25 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
FEIFAN LIU ETC.: "Product Named Entity Recognition Based on Hierarchical Hidden Markov Model", 《PROC. OF ACL FOURTH SIGHAN WORKSHOP WITH IJCNLP》 *
GUODONG ZHOU ETC.: "Named Entity Recognition using an HMM-based Chunk Tagger", 《PROCEEDINGS OF THE 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL)》 *
JING WANG ETC.: "A probabilistic model with multi-dimensional features for object extraction", 《FRONT.COMPUT.SCI.》 *
JINLIN CHEN ETC.: "Detecting Web Content Function Using Generalized Hidden Markov Model", 《PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS(ICMLA"06)》 *
张剑: "基于CRF的英文命名实体识别研究", 《哈尔滨工业大学硕士学位论文》 *
王静: "基于GHMM的Web文本信息抽取技术研究与系统设计", 《INFORMATION SCIENCE AND TECHNOLOGY》 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609853A (en) * 2012-02-27 2012-07-25 蒋永 System and method for intelligently identifying names and models of commodities
CN103377186A (en) * 2012-04-26 2013-10-30 富士通株式会社 Web service integration device, method and equipment based on identity of named entity
CN103377186B (en) * 2012-04-26 2016-03-16 富士通株式会社 Based on the web service integration of named entity recognition, method and equipment
CN103257983B (en) * 2012-09-10 2016-06-15 苏州大学 Deep Web entity identification method based on uniqueness constraint
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN106355628B (en) * 2015-07-16 2019-07-05 中国石油化工股份有限公司 The modification method and system of picture and text knowledge point mask method and device, picture and text mark
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN105045913B (en) * 2015-08-14 2018-08-28 北京工业大学 File classification method based on WordNet and latent semantic analysis
CN105426464A (en) * 2015-11-13 2016-03-23 北大方正集团有限公司 Method and device for identifying named entities
CN105426464B (en) * 2015-11-13 2019-03-29 北大方正集团有限公司 A kind of method and device of identification name entity
CN105550227A (en) * 2015-12-07 2016-05-04 中国建设银行股份有限公司 Named entity identification method and device
CN106874256A (en) * 2015-12-11 2017-06-20 北京国双科技有限公司 Name the method and device of entity in identification field
WO2017097166A1 (en) * 2015-12-11 2017-06-15 北京国双科技有限公司 Domain named entity recognition method and apparatus
CN106202054A (en) * 2016-07-25 2016-12-07 哈尔滨工业大学 A kind of name entity recognition method learnt based on the degree of depth towards medical field
CN106202054B (en) * 2016-07-25 2018-12-14 哈尔滨工业大学 A kind of name entity recognition method towards medical field based on deep learning
CN106407183B (en) * 2016-09-28 2019-06-28 医渡云(北京)技术有限公司 Medical treatment name entity recognition system generation method and device
CN106407183A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Method and device for generating medical named entity recognition system
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
US11010554B2 (en) 2016-11-08 2021-05-18 Beijing Gridsum Technology Co., Ltd. Method and device for identifying specific text information
CN106649272B (en) * 2016-12-23 2019-06-25 东北大学 A kind of name entity recognition method based on mixed model
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN106598950A (en) * 2016-12-23 2017-04-26 东北大学 Method for recognizing named entity based on mixing stacking model
CN106997342A (en) * 2017-03-27 2017-08-01 上海奔影网络科技有限公司 Intension recognizing method and device based on many wheel interactions
CN107870966A (en) * 2017-08-11 2018-04-03 成都萌想科技有限责任公司 A kind of recruitment general regulations data pick-up method based on semantic model
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN107943786B (en) * 2017-11-16 2021-12-07 广州市万隆证券咨询顾问有限公司 Chinese named entity recognition method and system
CN108509423A (en) * 2018-04-04 2018-09-07 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
CN109522553A (en) * 2018-11-09 2019-03-26 龙马智芯(珠海横琴)科技有限公司 Name recognition methods and the device of entity
WO2020132851A1 (en) * 2018-12-25 2020-07-02 Microsoft Technology Licensing, Llc Date extractor
US11321529B2 (en) 2018-12-25 2022-05-03 Microsoft Technology Licensing, Llc Date and date-range extractor
CN110210023A (en) * 2019-05-23 2019-09-06 竹间智能科技(上海)有限公司 A kind of calculation method of practical and effective name Entity recognition
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN110866402A (en) * 2019-11-18 2020-03-06 北京香侬慧语科技有限责任公司 Named entity identification method and device, storage medium and electronic equipment
CN110866402B (en) * 2019-11-18 2023-11-28 北京香侬慧语科技有限责任公司 Named entity identification method and device, storage medium and electronic equipment
CN111767733A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Document security classification discrimination method based on statistical word segmentation
CN112818667A (en) * 2021-01-29 2021-05-18 上海寻梦信息技术有限公司 Address correction method, system, device and storage medium
CN113033207A (en) * 2021-04-07 2021-06-25 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism
CN113033207B (en) * 2021-04-07 2023-08-29 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism

Similar Documents

Publication Publication Date Title
CN102314417A (en) Method for identifying Web named entity based on statistical model
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN111488726B (en) Unstructured text extraction multitasking joint training method based on pointer network
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
US20230195773A1 (en) Text classification method, apparatus and computer-readable storage medium
CN104834747B (en) Short text classification method based on convolutional neural networks
CN109635280A (en) A kind of event extraction method based on mark
CN103207855A (en) Fine-grained sentiment analysis system and method specific to product comment information
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN104484380A (en) Personalized search method and personalized search device
CN106407113A (en) Bug positioning method based on Stack Overflow and commit libraries
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
CN109492230A (en) A method of insurance contract key message is extracted based on textview field convolutional neural networks interested
Nguyen et al. Vlsp shared task: Named entity recognition
CN114091450B (en) Judicial domain relation extraction method and system based on graph convolution network
CN108763192B (en) Entity relation extraction method and device for text processing
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN112699685A (en) Named entity recognition method based on label-guided word fusion
Khine et al. Applying deep learning approach to targeted aspect-based sentiment analysis for restaurant domain
CN114722810A (en) Real estate customer portrait method and system based on information extraction and multi-attribute decision
CN109189848A (en) Abstracting method, system, computer equipment and the storage medium of knowledge data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120111