CN107330011B

CN107330011B - The recognition methods of the name entity of more strategy fusions and device

Info

Publication number: CN107330011B
Application number: CN201710447439.2A
Authority: CN
Inventors: 赵红红; 王萌萌; 晋耀红; 蒋宏飞; 杨凯程; 董铭慆
Original assignee: China Science And Technology (beijing) Co Ltd; Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2019-03-26
Anticipated expiration: 2037-06-14
Also published as: CN107330011A

Abstract

This application discloses a kind of recognition methods of the name entity of more strategy fusions and devices, the name entity in corpus obtained using the identification of the first identification model, obtain the first recognition result, in method provided by the present application, first identification model can update and expand corpus, so as to identify newly generated name entity in corpus, and then first recognition result has higher accuracy rate, the method for recycling more identification model fusions identifies the name entity in the corpus, obtain the second recognition result, it merges first recognition result and the second recognition result obtains third recognition result, semantic digging system is recycled to carry out role's distribution to third recognition result, and export the name entity with role, to realize in data magnanimity, entity type diversification, neologisms emerge one after another when reliably It identifies name entity, and role's distribution is carried out to the name entity identified.

Description

The recognition methods of the name entity of more strategy fusions and device

Technical field

This application involves natural language processing field more particularly to a kind of recognition methods of the name entity of more strategy fusions And device.

Background technique

Naming entity is exactly name, mechanism name, place name and other all entities with entitled mark, it is in text Basic information element is the important carrier of information representation, is the basis of correct understanding and processing text information.Chinese name is real Body identification is one of the basic task in natural language processing field, and main task is to identify that the name occurred in text is real Body and significant numeral classifier phrase are simultaneously sorted out, mainly include name, place name, institution term, temporal expression, the date, Numerical expression etc..

In terms of natural language processing research, name Entity recognition in information retrieval, information extraction, machine translation and text The application fields such as classification play an important role, it can improve information retrieval, abstract extraction, information extraction, machine translation significantly With the performance of the application systems such as text classification, lay a good foundation to obtain knowledge automatically from text.Name Entity recognition accurate The height of rate and recall rate directly decides the performance of the language understandings overall process such as syntactic analysis, semantic analysis.

In recent ten years, domestic and foreign scholars are existing to the entity recognition techniques in text inquires into and furthers investigate extensively.But With the rapid development of Internet, a large amount of random, multi-field text datas constantly increase, to the accurate of name Entity recognition Rate and recall rate are put forward new requirements, in addition, market there is also a need for carrying out role's distribution to the name entity recognized, because This, either caters to the market demand, or improves the accuracy rate and recall rate of identification, name the recognition methods of entity all need into One step is improved.

Currently used name entity recognition method is divided into two major classes: first is that rule-based and knowledge method, second is that base In the method for statistics.Rule-based and knowledge method is a kind of method used earliest, and this method is simple, convenient, disadvantage It is to need a large amount of artificial observation, it is portable poor.Statistics-Based Method will be named Entity recognition to regard a classification as and be asked Topic, using similar support vector machines, the classification methods such as Bayesian model；Name Entity recognition can also be regarded as a sequence simultaneously Column mark problem obtains sequence labelling using machine learning such as Hidden Markov Chain, maximum entropy Markov chain, condition random fields Model.But the above method or presence are difficult to meet and be named to random, the multi-field, texts that make rapid progress a large amount of at present The problem of Entity recognition, or the accuracy rate and recall rate of identification are low.

Such as, Chinese patent CN201610943210.3 disclose a kind of name entity recognition method based on artificial intelligence and Device, this method is by utilizing conditional random field models and the function mould generated according to the retrieval log in preset time period Type, while Entity recognition is named to text to be identified.The defect of the program is preset entity word in its second identification The functional mode that converges is to obtain candidate name entity word all in text to be identified by the methods of dictionary, rule match first Converge, and then judge its height as the confidence level of name entity vocabulary, due to regular method tend to rely on concrete syntax, Field and text formatting, compilation process is time-consuming and is easy to produce mistake, and needs exper ienced linguist that could complete, And the coverage rate of dictionary is relatively low, therefore this method be difficult to meet to it is a large amount of at present it is random, multi-field, make rapid progress texts into Row name Entity recognition.

For another example Chinese patent CN201510889318.4 discloses a kind of name Entity recognition side suitable for social networks Method, this method obtain the first instance probability distribution and test of Training document in the First ray marking model using initial construction After the second instance probability distribution of document, similarity feature is extracted from social network information, is based on similarity feature again later Training obtains the second sequence labelling model, and then obtains carrying out sequence labelling to test document based on the second sequence labelling model The recognition result of entity is named, the accuracy rate and recall rate of final this method are low, and the F value of identification is only 33.19%.

Therefore, need to develop that one kind copes with data scale magnanimity, entity type diversification, neologisms emerge one after another New situation has higher recall rate and accuracy rate, but also can carry out the life of role's distribution to the name entity that identification obtains Name entity recognition method and name entity recognition device.

Summary of the invention

This application provides a kind of recognition methods of the name entity of more strategy fusions and devices, to solve to advise in data In the case that mould magnanimity, entity type diversification, neologisms emerge one after another, to the accuracy rate and recall rate of name Entity recognition It is low, and can not to name entity carry out role's distribution the problem of.

In a first aspect, this application provides a kind of recognition methods of the name entity of more strategy fusions, the recognition methods Include:

Obtain corpus；

The name entity in the corpus is identified using the first identification model, obtains the first recognition result；

The name entity in the corpus is identified using the second identification model, obtains the second recognition result；

First recognition result and second recognition result are merged, third recognition result is obtained.

Optionally, first identification model is conditional random field models.

Optionally, in the name entity using in the first identification model identification corpus, the first recognition result is obtained Before step, further includes:

Establish corpus；

Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus；

Using the corpus after mark as training data, it is trained to obtain first identification using CRF kit Model.

Optionally,

The step of identifying the name entity in the corpus using the second identification model, obtain the second recognition result packet It includes:

The corpus is identified using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list；

Judge whether the recognition result in the sub- recognition result list meets output condition, output second is known if meeting Other result；

The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.

Optionally, described to identify the name entity in the corpus using the first identification model, obtain the first recognition result The step of include:

Judge whether the recognition result in the sub- recognition result list meets output condition, output first is known if meeting Other result；

Second identification model is conditional random field models；

Before described the step of identifying corpus using the second identification model, obtaining the second recognition result, further includes:

Establish corpus；

Using the corpus after mark as training data, it is trained to obtain second identification using CRF kit Model.

The fusion first recognition result and second recognition result, the step of obtaining third recognition result packet It includes:

Judge whether first recognition result and second recognition result meet fusion conditions, merged if meeting, And export fused result, that is, third recognition result；

Optionally, the fusion conditions are that there are identical names with second recognition result for first recognition result Entity.

Optionally, after obtaining third recognition result further include: using semantic digging system to the third recognition result Role's distribution is carried out, the name entity with role is generated.

Optionally, the role is assigned as naming entity to divide in the third recognition result using semantic digging system Not carry out role's label, and respectively output have role name entity.

Optionally, the semantic digging system includes regular expression and text.

Second aspect, the application also provide a kind of name entity recognition device of more strategy fusions, and the name entity is known Other device includes,

Corpus acquiring unit, for obtaining corpus；

First recognition unit obtains the first knowledge for identifying the name entity in the corpus using the first identification model Other result；

Second recognition unit obtains the second knowledge for identifying the name entity in the corpus using the second identification model Other result；

Recognition result integrated unit obtains third for merging first recognition result and second recognition result Recognition result.Optionally, first identification model is conditional random field models.

Optionally, first recognition unit further includes model training unit, and the model training unit is used for:

Establish corpus；

Optionally, second recognition unit includes following subelement:

More strategy recognition units, for identifying the name entity in the corpus using at least two identification models, every kind Identification model respectively obtains a sub- recognition result, generates sub- recognition result list；

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the second recognition result if meeting.

Optionally, the output condition is in the sub- recognition result list, and the number of identical name entity reaches pre- If value, wherein the preset value is the mode of at least two identification model.

Optionally, first recognition unit includes following subelement:

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the first recognition result if meeting；

Optionally, second identification model is conditional random field models；

Further include model training unit in second recognition unit, the model training unit is used for:

Establish corpus；

Optionally, the recognition result integrated unit, for judging that first recognition result and second identification are tied Whether fruit meets fusion conditions, merges if meeting, and export fused result, that is, third recognition result.

Optionally, the fusion refers to the name for increasing on the basis of the first recognition result and increasing newly in the second recognition result Entity；

Optionally, the fusion conditions are the presence of the name increased newly on the basis of the first recognition result in the second recognition result Entity.

It optionally, further include role's allocation unit, for being carried out using semantic digging system to the third recognition result Role's distribution, generates the name entity with role.

Optionally, role's allocation unit is used for using semantic digging system, to naming in the third recognition result Entity carries out role's label respectively, and output has the name entity of role respectively.

Optionally, the semantic digging system includes regular expression and text.

Detailed description of the invention

In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.

Fig. 1 shows a kind of method flow of the name entity recognition method of more strategy fusions provided by the embodiments of the present application Figure；

Fig. 2 shows the method flow diagrams of conditional random field models provided by the embodiments of the present application；

Fig. 3 shows the structural schematic diagram of name entity recognition device provided by the embodiments of the present application；

Fig. 4 shows the structural schematic diagram of computer system 400 provided by the embodiments of the present application；

Fig. 5 shows the accuracy rate, recall rate and F value result line chart of experimental example 1；

Fig. 6 shows the accuracy rate, recall rate and F value result line chart of experimental example 2.

Specific embodiment

It is described in detail below by the application, will become more with these explanations the characteristics of the application with advantage It is clear, clear.

Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.Although each of embodiment is shown in the attached drawings In terms of kind, but unless otherwise indicated, it is not necessary to attached drawing drawn to scale.

The application described below.

According to a first aspect of the present application, a kind of recognition methods of the name entity of more strategy fusions is provided, utilizes first The name entity in corpus that identification model identification obtains, obtains the first recognition result, described in method provided by the present application First identification model can update and expand corpus, so as to identify newly generated name entity, Jin Ersuo in corpus The first recognition result is stated with higher accuracy rate, the method for recycling more identification model fusions identifies the name in the corpus Entity obtains the second recognition result, merges first recognition result and the second recognition result obtains third recognition result, thus It realizes and reliably identifies name entity when data magnanimity, entity type diversification, neologisms emerge one after another, appoint Selection of land recycles semantic digging system to carry out role's distribution to third recognition result, and exports the name entity with role, from And role's distribution is carried out to the name entity identified.

Specifically, as shown in Figure 1, the name entity recognition method includes:

S101 obtains corpus；

S102 identifies the name entity in the corpus using the first identification model, obtains the first recognition result；

S103 identifies the name entity in the corpus using the second identification model, obtains the second recognition result；

S104 merges first recognition result and second recognition result, obtains third recognition result；

Optionally, further including S105 carries out role's distribution to the third recognition result using semantic digging system, generates Name entity with role.

In this application, the corpus refers to the text as training or identification.

In a kind of preferred embodiment of the application, first identification model is conditional random field models, that is, CRF Model (Conditional Random Fields, conditional random field models) has counted global probability in statistics, has done Data are considered when normalization in global distribution, rather than only in local normalization, so as to avoid asking for marking bias Topic.

In this application, as shown in Fig. 2, knowing when the first identification model selects CRF model using the first identification model The not described corpus, before obtaining the first recognition result further include:

S301 establishes corpus；

S302 carries out part-of-speech tagging and sequence labelling to the corpus in the corpus；

S303 is trained to obtain described first using CRF kit using the corpus after mark as training data Identification model.

In this application, the corpus refers to the set of the corpus of identification name entity, e.g., the people for public security system In name recognition method, corpus inventory is exactly notes set；Just for the corpus inventory in medical system name entity recognition method It is case set；The set for the corpus that crawler obtains from network also can be used in the corpus of no specific area.

In this application, the corpus of establishing includes that corpus imports, and imports the corpus in above-mentioned corpus.

In this application, the corpus in the corpus is processed into the format that can be identified by CRF first, that is, to language Material carries out part-of-speech tagging and sequence labelling, the training text string and test text string obtained, wherein the training text after mark This string is used as training data, and the test text string after mark is as test data.

In this application, when to CRF model training, the special characteristic of training data, then root are obtained according to feature templates It is trained according to special characteristic, part-of-speech tagging and sequence labelling result, obtains CRF model, the special characteristic includes up and down Literary feature, part of speech feature etc..

In this application, to CRF model training after training result is tested using test data, when identification tie When the F value of fruit is below 0.8, training data and test data are reacquired, continues to train, using newly obtaining after training Test data is tested, and when the F value of recognition result is less than 0.8, is repeated the above steps, until the F value of training result reaches 0.8 or more, deconditioning, to obtain the first identification model.

In the present embodiment, the name entity indicia that first identification model identifies has first location information.

In the present embodiment, described to identify the name entity in the corpus using the second identification model, obtain second The step of recognition result includes:

In this application, at least two identification model includes participle model and Named Entity Extraction Model.

In this application, the participle model includes nGram participle model (single order Markov Chain), HMM participle model (Hidden Markov Model), the participle model with new word discovery function.

In this application, the name physical model includes Named Entity Extraction Model based on maximum entropy, based on structure Change the Named Entity Extraction Model of perceptron.

In this application, the nGram participle model, which passes through to count first, obtains the statistical information of nGram, then basis The statistical information segments the corpus for needing to identify name entity, and this method can look after all possibility, but can also make Index entry increases, as " coming into search engine " can be divided by 2-gram participle: coming into, into searching, search for, index, engine.

In this application, the HMM participle model passes through the participle training set marked, obtains the parameters of HMM, so The corpus for needing to identify name entity is explained using viterbi algorithm afterwards, obtains word segmentation result, it is independent which is based on output Property is not it is assumed that consider contextual feature.

In this application, the participle model with new word discovery function passes through the discovery of the identification model of rule or statistics Name entity in corpus, but relatively depend on training corpus.

In this application, the Named Entity Extraction Model based on maximum entropy can obtain all constraint conditions that meet The model of Information Entropy Maximal in model, and model can be adjusted by setting constraint condition to the fitness of unknown data and right The fitting degree of given data, again, it can also solve the problems, such as parameter smoothing in statistical model naturally.But the model It calculates cost and is spaced apart that pin is larger, and Sparse Problem is than more serious in time.

In this application, feature extraction considers the overall situation in the Named Entity Extraction Model based on structuring perceptron Structuring output, so that model can carry out global structuring study.

In the present embodiment, the name entity that at least two identification model identifies is marked with the second position respectively Information.

In this application, the output condition is in the sub- recognition result list, and the number of identical name entity reaches To preset value, wherein it is whether identical with the whether identical name entity for judging that various identification models identify of second location information, The preset value is the mode of at least two identification model.

Therefore, the obtained recognition result of above-mentioned model is merged, can make up for it the intrinsic deficiency of each model itself, so that knowing Other result is optimal.

In this application, described is to be determined by the F value of experimental result, as shown in the application experimental example 1, when using essence Quasi- segmentation methods (combining language model, sequence labelling and Hidden Markov Model), the participle with new word discovery function are calculated When the name entity identification algorithms of method and structuring perceptron, mode takes 3, as a result optimal.

Applicants have discovered that judging whether to export recognition result in the sub- recognition result list, energy using output condition Enough farthest deletion misrecognitions are not as a result, such as wrong identification, to improve the recall rate of final recognition result.

Applicants have discovered that identifying the corpus using at least two identification models, name can be more accurately identified Entity supplements basic result so that multiple weak identification models are combined into a strong identification model, and then improves identification As a result.

It is described to identify the life in the corpus using the first identification model in the application another preferred embodiment Name entity, the step of obtaining the first recognition result, may also is that

In the present embodiment, the name entity that at least two identification model identifies is marked with first position respectively Information.

The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value whether identical with the whether identical name entity for judging that various identification models identify of first location information For the mode of at least two identification model.

In the present embodiment, second identification model is conditional random field models, preferably conditional random field models.

In the present embodiment, marking on second recognition result has.

In this application, fusion first recognition result and second recognition result obtain third identification knot The step of fruit includes:

Judge whether first recognition result and second recognition result meet fusion conditions, merged if meeting, And export fused result, that is, third recognition result.

Applicants have discovered that the first recognition result is merged with the second recognition result Ji Wei the first recognition result of removal and the Duplicate name entity in two recognition results so as to avoid the redundancy of data, and then improves the accuracy rate of identification and recalls Rate.

In this application, it is described fusion refer to increase on the basis of the first recognition result in the second recognition result increase newly Name entity.

In this application, the fusion conditions are to exist to increase newly on the basis of the first recognition result in the second recognition result Name entity.

In a kind of preferred embodiment of the application, judge whether second location information and first location information are identical, If it is different, then judging the name entity for name entity newly-increased in the second recognition result.

Optionally, the semantic digging system, names entity to carry out role's label respectively in the third recognition result, And output has the name entity of role respectively.

In this application, the semantic digging system can not only carry out role's distribution, additionally it is possible to name Entity recognition As a result judged, determine whether it is name entity.

The semanteme digging system includes regular expression and text.

For the recognition methods for naming entity for being more fully understood by more strategy fusions described herein, it is set forth below one Specific embodiment is illustrated.

Establish corpus.

To the corpus in corpus, i.e., each subordinate sentence in corpus carries out part-of-speech tagging and sequence labelling, wherein sequence mark The corresponding word of entity will be named to be labeled when note with B, M, E, remaining word is marked with S, the training text string of acquisition.Assuming that one Training text string is " checking in discovery satchel there is Xu Sanguan identity card through people's police ", and annotation results are as shown in table 1.

1 text string of table marks example

Using the corresponding annotation results of a large amount of training text strings as training data, it is trained using CRF.

Assuming that it is that " victim Ni Chengang alarm claims to find mobile phone not in Qinghe Oak Tree gulf that the user being currently received, which inputs corpus, See ".The CRF model obtained using preceding step is inputted corpus to the user and is named Entity recognition, available name Entity " Ni Chengang ".

Supplement amendment is carried out to CRF result using the method that a variety of method integrations learn later, such as accurate word segmentation result will Name Entity recognition in upper example is " Ni Chen ", and structuring perceptron recognition result is " Ni Chengang ", with new word discovery function Recognition result is " Ni Chengang ", takes mode to the recognition result of several method, can determine that name Entity recognition result is " Ni Chen Just ", rather than " Ni Chen ".

It on the one hand can such as " victim's alarm " be determined by regular expression present in semantic digging system or text " Ni Chengang " be correctly name Entity recognition as a result, on the other hand can by role be determined as " victim ".

According to a second aspect of the present application, as shown in figure 3, additionally providing a kind of name Entity recognition dress of more strategy fusions It sets, the name entity recognition devices of more strategy fusions include,

Corpus acquiring unit 201, for obtaining corpus；

First recognition unit 202 obtains first for identifying the name entity in the corpus using the first identification model Recognition result；

Second recognition unit 203 obtains second for identifying the name entity in the corpus using the second identification model Recognition result；

Recognition result integrated unit 204 obtains for merging first recognition result and second recognition result Three recognition results；

Optionally, further include role's allocation unit 205, for using semantic digging system to the third recognition result into Row role distribution, generates the name entity with role.

In a kind of optional embodiment of the application, first identification model is conditional random field models.

Establish corpus；

Optionally, second recognition unit includes following subelement:

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the second recognition result if meeting；

In another optional embodiment of the application, first recognition unit includes following subelement:

Optionally, second identification model is conditional random field models；

Establish corpus；

Optionally, the fusion conditions are that there are identical name entities for the second recognition result and the first recognition result.

Optionally, the semantic digging system includes regular expression and text.

Fig. 4 shows the block diagram that can implement the computer system 400 of embodiment on it.Computer system 400 is wrapped Include processor 410, storage medium 420, system storage 430, monitor 440, keyboard 450, mouse 460,420 and of network interface Video adapter 480.These components are coupled by system bus 490.

Storage medium 420 (such as hard disk) stores multiple programs, including operating system, application program and other program moulds Block.User can be inputted by input equipment into computer system 400 order and information, input equipment be, for example, keyboard 450, Touch tablet (not shown) and mouse 460.Come display text and graphical information using monitor 440.

Operating system is on processor 410 and for coordinating and providing in the personal computer system 400 in Fig. 6 Various parts control.Furthermore, it is possible to using computer program to implement above-mentioned various implementations in computer system 400 Example.

It would be recognized that hardware component shown in Fig. 4 is only for illustrative purposes, and physical unit may be according to being real It applies the application and the calculating equipment disposed and changes.

In addition, computer system 400 for example can be desktop computer, server computer, laptop computer or nothing Line equipment, such as mobile phone, personal digital assistant (PDA), handheld computer etc..

The embodiment provides a kind of effective ways that name entity is extracted in the case where given document collected works.Implement Example solves the problems, such as to extract any type entity from the webpage generally organized with least cost.The weighting name entity proposed Figure can encode the complex relationship between each name entity and the type of other entities, therefore propagate seed on the diagram Confidence level can make up for it the shortage of network size redundancy, and effective size of the organization can be supported to extract.Furthermore, it is possible to will life Confidence spread on name sterogram is transformed into efficient matrix and calculates, and can support the high efficiency extraction on extensive collected works.

It would be recognized that the embodiment within the scope of the application can be embodied as to the form of computer program product, computer Program product includes computer executable instructions, such as program code, can be run in conjunction with any of appropriate operating system Appropriate to calculate environmentally, operating system is, for example, Microsoft Windows, Linux or UNIX operating system.The application range Interior embodiment can also include program product, and program product includes that computer-readable medium can for carrying or storing computer Execute instruction or data structure thereon.Such computer-readable medium can be it is any can be by general or specialized calculating The usable medium of machine access.For example, such computer-readable medium may include RAM, ROM, EPROM, EEPROM, CD- ROM, magnetic disk storage or other storage devices, or can be used in carrying or storing desired with form of computer-executable instructions Program code and any other medium that can be accessed by general or specialized computer.

Experimental example

Influence of the mode value to F value when experimental example 1 second identifies

Used in the second identification step when the second identification in this experimental example, preset value is different, final name entity Recognition result significant difference, this experimental example have investigated influence of the preset value to name Entity recognition result.

The preset value is the mode of at least two identification model；

The name Entity recognition result is measured by F value, and F value is higher, and recognition result is more reliable, wherein

Accuracy rate (P)=name Entity recognition correct number/machine recognition name entity number,

Name entity number in recall rate (R)=correct number/testing material of name Entity recognition.

F value=2*P*R/ (P+R).

Identification model used when the second identification includes accurate segmentation methods, with new word discovery function in this experimental example The name entity identification algorithms of segmentation methods, structuring perceptron, wherein

Precisely participle is the segmentation methods of a kind of combination language model, sequence labelling and Hidden Markov Model, it is preferable that Thick cutting is carried out using N-gram and Hidden Markov Model first, CRF is then reused and fritter point；

Segmentation methods with new word discovery function find the neologisms in text by the identification model of rule or statistics；

Structuring perceptron is for solving the problems, such as sequence labelling.

The result of this experimental example as shown in Fig. 5 and table 1,

Influence of 1 preset value of table to name Entity recognition result

In Fig. 5, broken line A is the corresponding recall rate broken line of each preset value；Broken line B shows the corresponding F value folding of each preset value Line；Broken line C is the corresponding accuracy rate broken line of each preset value.

By Fig. 5 and table 1 it is found that in this experimental example, when mode value is 3, F value reaches maximum.

Entity recognition result is named when each identification model of experimental example 2 is used alone

A kind of identification model is used alone to name Entity recognition as a result, to compare single identification mould in the test of this experimental example Type merges the reliability of two kinds of name entity recognition methods with more identification models.

Identification model used is respectively CRF identification model used in preliminary identification, the second identification in this experimental example Used in precisely segmentation methods, the segmentation methods with new word discovery function, structuring perceptron name Entity recognition calculate Method, as a result as shown in Fig. 6 and table 2.

The reliability of the single identification model name entity recognition method of table 2

In Fig. 6, broken line A is the corresponding recall rate broken line of each recognition methods；Broken line B shows the corresponding F of each recognition methods It is worth broken line；Broken line C is the corresponding accuracy rate broken line of each recognition methods.

By Fig. 6 and table 2 it is found that the name entity recognition method (name of i.e. more strategy fusions merged with more identification models Entity recognition method) and (experimental example 1, mode be 3 result) compare, single identification model name entity recognition method F value compared with It is low, that is, the name Entity recognition result obtained with the name entity recognition method of more identification models fusion provided by the present application is more It is reliable and stable.

The name Entity recognition result of 3 each identification model of the application method of experimental example

This experimental example utilizes method provided by the present application, calculates separately the first recognition result, the second recognition result and third Accuracy rate, recall rate and the F value of recognition result, as a result as shown in table 3 below.

The name Entity recognition result of 3 each identification model of the application method of table

As shown in Table 3, according to method provided by the present application, on the basis of the first recognition result and the second recognition result The third recognition result arrived, accuracy rate, recall rate and F value have raising by a relatively large margin, that is, method provided by the present application The new situations such as data scale magnanimity, entity type are diversified, neologisms emerge one after another are coped with, there is higher recall rate and standard True rate.

According to the name entity recognition method and identification device of more strategy fusions provided by the present application, have below beneficial to effect Fruit:

(1) scheme provided by the present application can be named entity to new data or frontier by preliminary identification step and know Not, thus adapt to data scale magnanimity, when entity type diversification, neologisms emerge one after another to name Entity recognition Demand；

(2) second identification steps name the fusion of entity recognition method by more identification models, by multiple weak identification models It is combined into a strong identification model, the first recognition result is supplemented, to improve recognition result accuracy rate and recall rate；

(3) role's label is carried out using the name entity that semantic digging system obtains the second identification, to obtain role Name entity after distribution；

(4) method provided by the present application can easily be migrated into new data and frontier and be used；

(5) method provided by the present application accuracy rate with higher and recall rate, F value is up to 0.8 or more.

Combine detailed description and exemplary example that the application is described in detail above, but these explanations are simultaneously It should not be understood as the limitation to the application.It will be appreciated by those skilled in the art that without departing from the application spirit and scope, A variety of equivalent substitution, modification or improvements can be carried out to technical scheme and embodiments thereof, these each fall within the application In the range of.The protection scope of the application is determined by the appended claims.

Claims

1. a kind of recognition methods of the name entity of more strategy fusions characterized by comprising

Obtain corpus；

First recognition result and second recognition result are merged, third recognition result is obtained；

It is described using the second identification model identify the name entity in the corpus, obtain the second recognition result the step of include:

The name entity in the corpus is identified using at least two identification models, and every kind of identification model respectively obtains a son and knows Not as a result, generating sub- recognition result list；

Judge whether the recognition result in the sub- recognition result list meets output condition, the second identification of output knot if meeting Fruit；

The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, wherein institute State the mode that preset value is at least two identification model；

At least two identification model includes participle model and Named Entity Extraction Model, wherein the participle model includes NGram participle model, HMM participle model, the participle model with new word discovery function, the name physical model include being based on The Named Entity Extraction Model of maximum entropy, the Named Entity Extraction Model based on structuring perceptron；

The fusion first recognition result and second recognition result, the step of obtaining third recognition result include:

Judge whether first recognition result and second recognition result meet fusion conditions, is merged if meeting, and defeated Fused result out, that is, third recognition result；

The fusion refers to the name entity for increasing on the basis of the first recognition result and increasing newly in the second recognition result；

The fusion conditions are the presence of the name entity increased newly on the basis of the first recognition result in the second recognition result；

After obtaining third recognition result further include:

Role's distribution is carried out to the third recognition result using semantic digging system, generates the name entity with role, In,

The role is assigned as naming entity to carry out role's mark respectively in the third recognition result using semantic digging system Note, and output has the name entity of role respectively；

The semanteme digging system includes regular expression and text.

2. recognition methods according to claim 1, which is characterized in that

First identification model is conditional random field models；

Before described the step of utilizing the first identification model to identify the name entity in corpus, obtaining the first recognition result, also Include:

Establish corpus；

It using the corpus after mark as training data, is trained using CRF kit, obtains first identification model.

3. a kind of name entity recognition device of more strategy fusion, which is characterized in that the name entity recognition device includes,

Corpus acquiring unit, for obtaining corpus；

First recognition unit obtains the first identification knot for identifying the name entity in the corpus using the first identification model Fruit；

Second recognition unit obtains the second identification knot for identifying the name entity in the corpus using the second identification model Fruit；

Recognition result integrated unit obtains third identification for merging first recognition result and second recognition result As a result；

Second recognition unit includes following subelement:

More strategy recognition units, for identifying the name entity in the corpus, every kind of identification using at least two identification models Model respectively obtains a sub- recognition result, generates sub- recognition result list；

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output condition, The second recognition result is exported if meeting；

The recognition result integrated unit melts for judging first recognition result with whether second recognition result meets Conjunction condition merges if meeting, and exports fused result, that is, third recognition result；

The name entity recognition device further includes role's allocation unit, for being identified using semantic digging system to the third As a result role's distribution is carried out, the name entity with role is generated, wherein

Role's allocation unit is used to name entity to carry out respectively in the third recognition result using semantic digging system Role's label, and output has the name entity of role respectively；

The semanteme digging system includes regular expression and text.

4. identification device according to claim 3, which is characterized in that

First identification model is conditional random field models；

Further include model training unit in first recognition unit, the model training unit is used for:

Establish corpus；

Using the corpus after mark as training data, it is trained to obtain first identification model using CRF kit.