CN107330011A

CN107330011A - The recognition methods of the name entity of many strategy fusions and device

Info

Publication number: CN107330011A
Application number: CN201710447439.2A
Authority: CN
Inventors: 赵红红; 王萌萌; 晋耀红; 蒋宏飞; 杨凯程
Original assignee: China Science And Technology (beijing) Co Ltd; Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2017-11-07
Anticipated expiration: 2037-06-14
Also published as: CN107330011B

Abstract

This application discloses a kind of recognition methods of name entity of many strategy fusions and device, the name entity in the language material obtained is recognized using the first identification model, obtain the first recognition result, in the method that the application is provided, first identification model can update and expand corpus, so as to identify the name entity newly produced in language material, and then first recognition result has higher accuracy rate, recycle the name entity in the method identification language material of many identification model fusions, obtain the second recognition result, merge first recognition result and the second recognition result obtains the 3rd recognition result, semantic digging system is recycled to carry out role's distribution to the 3rd recognition result, and export the name entity with role, it is achieved thereby that in data magnanimity, entity type variation, neologisms reliably identify name entity when emerging in an endless stream, and role's distribution is carried out to the name entity identified.

Description

The recognition methods of the name entity of many strategy fusions and device

Technical field

The application is related to natural language processing field, more particularly to a kind of recognition methods of the name entity of many strategy fusions And device.

Background technology

Name entity is exactly name, mechanism name, place name and other all entities with entitled mark, during it is text Basic information element, is the important carrier of information representation, is the basis of correct understanding and processing text message.Chinese name is real Body identification is one of basic task in natural language processing field, and its main task is to identify that the name occurred in text is real Body and significant numeral classifier phrase are simultaneously sorted out, mainly including name, place name, institution term, temporal expression, the date, Numerical expression etc..

In terms of natural language processing research, name Entity recognition is in information retrieval, information extraction, machine translation and text The application fields such as classification play an important role, and it can significantly increase information retrieval, abstract extraction, information extraction, machine translation It is that the automatic knowledge that obtains is laid a good foundation from text with the performance of the application system such as text classification.Name Entity recognition accurate The height of rate and recall rate, directly decides the performance of the language understanding overall process such as syntactic analysis, semantic analysis.

In recent ten years, domestic and foreign scholars have been inquired into and furtherd investigate extensively to the entity recognition techniques in text.But With the rapid development of Internet, a large amount of random, multi-field text datas constantly increase, to the accurate of name Entity recognition Rate and recall rate propose new requirement, in addition, market there is also a need for carrying out role's distribution to the name entity recognized, because This, either caters to the market demand, or improves the accuracy rate and recall rate of identification, the recognition methods of name entity all need into One step is improved.

Conventional name entity recognition method is divided into two major classes at present：One is rule-based and knowledge method, and two be base In the method for statistics.Rule-based and knowledge method is a kind of method used earliest, and this method is simple, convenient, shortcoming It is to need substantial amounts of manual observation, it is portable poor.Statistics-Based Method will be named Entity recognition to regard a classification as and be asked Topic, using similar SVMs, the sorting technique such as Bayesian model；Name Entity recognition can also be regarded as a sequence simultaneously Row mark problem, sequence labelling is obtained using machine learning such as HMM, maximum entropy Markov chain, condition random fields Model.But the above method or exist to be difficult to meet random, the multi-field, texts that make rapid progress a large amount of at present are named The problem of Entity recognition, or the accuracy rate and recall rate of identification are low.

Such as, Chinese patent CN201610943210.3 disclose a kind of name entity recognition method based on artificial intelligence and Device, the function mould that this method is generated by using conditional random field models and according to the retrieval daily record in preset time period Type, while being named Entity recognition to text to be identified.The defect of the program is default entity word in its second identification The functional mode that converges is to obtain candidates all in text to be identified by methods such as dictionary, rule match first to name entity word Converge, so judge its as name entity vocabulary confidence level height, due to rule method tend to rely on concrete syntax, Field and text formatting, compilation process is time-consuming and easily produces mistake, and needs exper ienced linguist to complete, And the coverage rate of dictionary is relatively low, therefore this method is difficult to meet to largely random, the multi-field, texts that make rapid progress enter at present Row name Entity recognition.

For another example Chinese patent CN201510889318.4 discloses a kind of name Entity recognition side suitable for social networks Method, this method obtains first instance probability distribution and the test of Training document in the First ray marking model using initial construction After the second instance probability distribution of document, similarity feature is extracted from social network information, similarity feature is based on again afterwards Training obtains the second sequence labelling model, and then is obtained carrying out sequence labelling to test document based on the second sequence labelling model The recognition result of entity is named, the accuracy rate and recall rate of final this method are low, its F value recognized is only 33.19%.

Therefore, need that exploitation one kind copes with data scale magnanimity, entity type variation, neologisms emerge in an endless stream badly New situation, with higher recall rate and accuracy rate, but also the name entity that can be obtained to identification carries out the life of role's distribution Name entity recognition method and name entity recognition device.

The content of the invention

This application provides a kind of recognition methods of name entity of many strategy fusions and device, to solve to advise in data In the case that mould magnanimity, entity type variation, neologisms emerge in an endless stream, accuracy rate and recall rate to naming Entity recognition It is low, and can not be to naming the problem of entity carries out role's distribution.

In a first aspect, this application provides a kind of recognition methods of the name entity of many strategy fusions, the recognition methods Including：

Obtain language material；

The name entity in the language material is recognized using the first identification model, the first recognition result is obtained；

The name entity in the language material is recognized using the second identification model, the second recognition result is obtained；

First recognition result and second recognition result are merged, the 3rd recognition result is obtained.

Alternatively, first identification model is conditional random field models.

Alternatively, the name entity in the identification language material using the first identification model, obtains the first recognition result Before step, in addition to：

Set up corpus；

Part-of-speech tagging and sequence labelling are carried out to the language material in the corpus；

Using the language material after mark as training data, it is trained to obtain first identification using CRF kits Model.

Alternatively,

The name entity recognized using the second identification model in the language material, the step of obtaining the second recognition result is wrapped Include：

The language material is recognized using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list；

Judge whether the recognition result in the sub- recognition result list meets output condition, output second is known if meeting Other result；

The output condition is that in the sub- recognition result list, the number of identical name entity reaches preset value, its In, the preset value is the mode of at least two identification model.

Alternatively, the name entity recognized using the first identification model in the language material, obtains the first recognition result The step of include：

Judge whether the recognition result in the sub- recognition result list meets output condition, output first is known if meeting Other result；

Second identification model is conditional random field models；

Described using the second identification model identification language material, before the step of obtaining the second recognition result, in addition to：

Set up corpus；

Using the language material after mark as training data, it is trained to obtain second identification using CRF kits Model.

Fusion first recognition result is with second recognition result, and the step of obtaining three recognition results is wrapped Include：

Judge whether first recognition result meets fusion conditions with second recognition result, merged if meeting, And export the result after fusion, i.e. the 3rd recognition result；

Alternatively, the fusion conditions are that first recognition result has identical name with second recognition result Entity.

Alternatively, also include after the 3rd recognition result is obtained：Using semantic digging system to the 3rd recognition result Carry out role's distribution, name entity of the generation with role.

Alternatively, the role is assigned as using semantic digging system, to naming entity point in the 3rd recognition result Not carry out role's mark, and respectively output with role name entity.

Alternatively, the semantic digging system includes regular expression and text.

Second aspect, the application also provides a kind of name entity recognition device of many strategy fusions, and the name entity is known Other device includes,

Language material acquiring unit, for obtaining language material；

First recognition unit, for recognizing the name entity in the language material using the first identification model, obtains the first knowledge Other result；

Second recognition unit, for recognizing the name entity in the language material using the second identification model, obtains the second knowledge Other result；

Recognition result integrated unit, for merging first recognition result and second recognition result, obtains the 3rd Recognition result.Alternatively, first identification model is conditional random field models.

Alternatively, first recognition unit also includes model training unit, and the model training unit is used for：

Set up corpus；

Alternatively, second recognition unit includes following subelement：

Many strategy recognition units, it is every kind of for recognizing the name entity in the language material using at least two identification models Identification model respectively obtains a sub- recognition result, generates sub- recognition result list；

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part, the second recognition result is exported if meeting.

Alternatively, the output condition be in the sub- recognition result list, it is identical name entity number reach it is pre- If value, wherein, the preset value is the mode of at least two identification model.

Alternatively, first recognition unit includes following subelement：

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part, the first recognition result is exported if meeting；

Alternatively, second identification model is conditional random field models；

Also include model training unit in second recognition unit, the model training unit is used for：

Set up corpus；

Alternatively, the recognition result integrated unit, for judging that first recognition result is tied with the described second identification Whether fruit meets fusion conditions, is merged if meeting, and exports the result after fusion, i.e. the 3rd recognition result.

Alternatively, the fusion refers to increase on the basis of the first recognition result the name increased newly in the second recognition result Entity；

Alternatively, the fusion conditions be the second recognition result in exist on the basis of the first recognition result increase newly name Entity.

Alternatively, in addition to role's allocation unit, for being carried out using semantic digging system to the 3rd recognition result Role distributes, name entity of the generation with role.

Alternatively, role's allocation unit is used for using semantic digging system, to being named in the 3rd recognition result Entity carries out role's mark, and name entity of the output with role respectively respectively.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme of the application, letter will be made to the required accompanying drawing used in embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 shows a kind of method flow of the name entity recognition method for many strategy fusions that the embodiment of the present application is provided Figure；

Fig. 2 shows the method flow diagram for the conditional random field models that the embodiment of the present application is provided；

Fig. 3 shows the structural representation for the name entity recognition device that the embodiment of the present application is provided；

Fig. 4 shows the structural representation for the computer system 400 that the embodiment of the present application is provided；

Fig. 5 shows the accuracy rate, recall rate and F value result line charts of experimental example 1；

Fig. 6 shows the accuracy rate, recall rate and F value result line charts of experimental example 2.

Embodiment

It is described in detail, will with these explanations becomes more with advantage the characteristics of the application below by the application To be clear, clear and definite.

Special word " exemplary " is meant " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.Although each of embodiment is shown in the drawings In terms of kind, but unless otherwise indicated, it is not necessary to accompanying drawing drawn to scale.

The application described below.

According to the first aspect of the application there is provided a kind of recognition methods of the name entity of many strategy fusions, first is utilized Name entity in the language material that identification model identification is obtained, obtains the first recognition result, described in the method that the application is provided First identification model can update and expand corpus, so as to identify the name entity newly produced in language material, Jin Ersuo The first recognition result is stated with higher accuracy rate, the name in the method identification language material of many identification model fusions is recycled Entity, obtains the second recognition result, merges first recognition result and the second recognition result obtains the 3rd recognition result, so that Realize and reliably identify name entity when data magnanimity, entity type variation, neologisms emerge in an endless stream, appoint Selection of land, recycles semantic digging system to carry out role's distribution to the 3rd recognition result, and exports the name entity with role, from And role's distribution is carried out to the name entity identified.

Specifically, as shown in figure 1, the name entity recognition method includes：

S101 obtains language material；

S102 recognizes the name entity in the language material using the first identification model, obtains the first recognition result；

S103 recognizes the name entity in the language material using the second identification model, obtains the second recognition result；

S104 merges first recognition result and second recognition result, obtains the 3rd recognition result；

Optionally, in addition to S105 carries out role's distribution, generation to the 3rd recognition result using semantic digging system Name entity with role.

In this application, the language material refers to the text for being used as training or identification.

In the application is a kind of preferred embodiment, first identification model is conditional random field models, i.e. CRF Model (Conditional Random Fields, conditional random field models), it has counted global probability in statistics, done Data are considered during normalization in global distribution, rather than only in local normalization, so as to avoid asking for marking bias Topic.

In this application, as shown in Fig. 2 when the first identification model selects CRF models, knowing using the first identification model Not described language material, obtains also including before the first recognition result：

S301 sets up corpus；

S302 carries out part-of-speech tagging and sequence labelling to the language material in the corpus；

Language material after mark as training data, is trained to obtain described first by S303 using CRF kits Identification model.

In this application, the corpus refers to the set of the language material of identification name entity, e.g., the people for public security system In name recognition method, language material stock is exactly notes set；For the language material stock in medical system name entity recognition method just It is case set；Corpus without specific area can also use the set for the language material that reptile obtains from network.

In this application, the corpus of setting up is imported including language material, the language material imported in above-mentioned corpus.

In this application, the language material in the corpus is processed into the form that can be recognized by CRF first, i.e. to language Material progress part-of-speech tagging and sequence labelling, the training text string and test text string obtained, wherein, the training text after mark This string is as training data, and the test text string after mark is used as test data.

In this application, when to CRF model trainings, the special characteristic of training data, then root are obtained according to feature templates It is trained according to special characteristic, part-of-speech tagging and sequence labelling result, obtains CRF models, above and below the special characteristic includes Literary feature, part of speech feature etc..

In this application, training result is tested using test data after being finished to CRF model trainings, when identification knot When the F values of fruit are below 0.8, training data and test data are reacquired, continues to train, training uses what is newly obtained after finishing Test data is tested, and when the F values of recognition result are less than 0.8, is repeated the above steps, until the F values of training result reach More than 0.8, deconditioning, so as to obtain the first identification model.

In the present embodiment, the name entity indicia that the first identification model identification is obtained has first position information.

In the present embodiment, the name entity recognized using the second identification model in the language material, obtains second The step of recognition result, includes：

In this application, at least two identification model includes participle model and Named Entity Extraction Model.

In this application, the participle model includes nGram participle models (single order Markov Chain), HMM participle models (HMM), the participle model with new word discovery function.

In this application, the name physical model includes the Named Entity Extraction Model based on maximum entropy, based on structure Change the Named Entity Extraction Model of perceptron.

In this application, the nGram participle models obtain nGram statistical information by statistics first, then basis The statistical information carries out participle to the language material for needing to recognize name entity, and this method can be looked after and is possible to, but can also be made Index entry increase, such as " coming into search engine " can be divided into by 2-gram participles：Come into, enter to search, search for, indexing, engine.

In this application, the HMM participle models obtain HMM parameters, so by the participle training set marked The language material for needing to recognize name entity is explained using viterbi algorithm afterwards, word segmentation result is obtained, it is independent that the model is based on output Property is not it is assumed that consider contextual feature.

In this application, the participle model with new word discovery function is found by the identification model of rule or statistics Name entity in language material, but relatively depend on training corpus.

In this application, the Named Entity Extraction Model based on maximum entropy results in all constraintss that meet The model of Information Entropy Maximal in model, and can be by setting constraints to adjust model to the fitness of unknown data and right The fitting degree of given data, again, the problem of it can also solve parameter smoothing in statistical model naturally.But the model The timely air switch pin of calculation cost is larger, and Sparse Problem is than more serious.

In this application, feature extraction considers the overall situation in the Named Entity Extraction Model based on structuring perceptron Structuring output so that model can carry out the overall situation structuring study.

In the present embodiment, the name entity that at least two identification model is identified is marked with the second place respectively Information.

In this application, the output condition is that in the sub- recognition result list, the number of identical name entity reaches To preset value, wherein, it is whether identical with the whether identical name entity for judging that various identification models are identified of second place information, The preset value is the mode of at least two identification model.

Therefore, the recognition result obtained by above-mentioned model is merged, the intrinsic deficiency of each model itself can be made up so that is known Other result is optimal.

In this application, described is to be determined by the F values of experimental result, as shown in the application experimental example 1, when using essence Quasi- segmentation methods (combining language model, sequence labelling and HMM), the participle with new word discovery function are calculated During the name entity identification algorithms of method and structuring perceptron, mode takes 3, as a result optimal.

Applicants have discovered that, judge whether recognition result, energy in the output sub- recognition result list using output condition The enough farthest other result of deletion misrecognition, such as wrong identification, so as to improve the recall rate of final recognition result.

Applicants have discovered that, the language material is recognized using at least two identification models, name can be more accurately identified Entity, so that multiple weak identification models are combined into one strong identification model, is supplemented basic result, and then improves identification As a result.

In the application another preferred embodiment, the life recognized using the first identification model in the language material Name entity, the step of obtaining the first recognition result can also be：

In the present embodiment, the name entity that at least two identification model is identified is marked with first position respectively Information.

The output condition is that in the sub- recognition result list, the number of identical name entity reaches preset value, its In, the preset value whether identical with the whether identical name entity for judging that various identification models are identified of first position information For the mode of at least two identification model.

In the present embodiment, second identification model is conditional random field models, preferably conditional random field models.

In the present embodiment, mark has information on second recognition result.

In this application, fusion first recognition result and second recognition result, obtain the 3rd identification knot The step of fruit, includes：

Judge whether first recognition result meets fusion conditions with second recognition result, merged if meeting, And export the result after fusion, i.e. the 3rd recognition result.

Applicants have discovered that, the first recognition result is merged with the second recognition result Ji Wei removing the first recognition result and the The name entity repeated in two recognition results, so as to avoid the redundancy of data, and then improves the accuracy rate of identification and recalls Rate.

In this application, the fusion refers to increase on the basis of the first recognition result what is increased newly in the second recognition result Name entity.

In this application, the fusion conditions be the second recognition result in exist on the basis of the first recognition result increase newly Name entity.

In the application is a kind of preferred embodiment, judge whether second place information and first position information are identical, If it is different, the name entity increased newly in then judging the name entity for the second recognition result.

Alternatively, the semantic digging system, to naming entity to carry out role's mark respectively in the 3rd recognition result, And output has the name entity of role respectively.

In this application, the semantic digging system can not only carry out role's distribution, additionally it is possible to name Entity recognition As a result judged, determine whether it is name entity.

The semantic digging system includes regular expression and text.

For the recognition methods for the name entity for being more fully understood by many strategy fusions described herein, one is set forth below Specific embodiment is illustrated.

Set up corpus.

Part-of-speech tagging and sequence labelling, wherein sequence mark are carried out to each subordinate sentence in the language material in corpus, i.e. language material The corresponding word of entity will be named to be labeled with B, M, E during note, remaining word is marked with S, the training text string of acquisition.Assuming that one Training text string is " check to have in discovery satchel through people's police and three see identity card perhaps ", and annotation results are as shown in table 1.

The text string of table 1 marks example

Using the corresponding annotation results of a large amount of training text strings as training data, it is trained using CRF.

Assuming that the user's input language material being currently received is " victim Ni Chengang alarms claim to find mobile phone not in Qinghe Oak Tree gulf See ".The CRF models obtained using preceding step are named Entity recognition to user input language material, can be named Entity " Ni Chengang ".

The method learnt afterwards using a variety of method integrations carries out supplement amendment to CRF results, and such as accurate word segmentation result will Name Entity recognition in upper example is " Ni Chen ", and structuring perceptron recognition result is " Ni Chengang ", with new word discovery function Recognition result is " Ni Chengang ", and mode is taken to the recognition result of several method, it may be determined that name Entity recognition result is " Ni Chen Just ", rather than " Ni Chen ".

By regular expression present in semantic digging system or text, such as " victim's alarm ", on the one hand it can determine " Ni Chengang " is correct name Entity recognition result, on the other hand can by role be defined as " victim ".

According to the second aspect of the application, as shown in figure 3, additionally providing a kind of name Entity recognition dress of many strategy fusions Put, the name entity recognition device of many strategy fusions includes,

Language material acquiring unit 201, for obtaining language material；

First recognition unit 202, for recognizing the name entity in the language material using the first identification model, obtains first Recognition result；

Second recognition unit 203, for recognizing the name entity in the language material using the second identification model, obtains second Recognition result；

Recognition result integrated unit 204, for merging first recognition result and second recognition result, obtains Three recognition results；

Optionally, in addition to role's allocation unit 205, for being entered using semantic digging system to the 3rd recognition result Row role distributes, name entity of the generation with role.

In a kind of optional embodiment of the application, first identification model is conditional random field models.

Set up corpus；

Alternatively, second recognition unit includes following subelement：

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part, the second recognition result is exported if meeting；

In another optional embodiment of the application, first recognition unit includes following subelement：

Set up corpus；

Alternatively, the fusion conditions are that the second recognition result and the first recognition result have identical name entity.

Fig. 4 show can thereon implement embodiment computer system 400 block diagram.Computer system 400 is wrapped Include processor 410, storage medium 420, system storage 430, monitor 440, keyboard 450, mouse 460, the and of network interface 420 Video adapter 480.These parts are coupled by system bus 490.

Storage medium 420 (such as hard disk) stores multiple programs, including operating system, application program and other program moulds Block.User can input into computer system 400 order and information by input equipment, input equipment be, for example, keyboard 450, Touch pad (not shown) and mouse 460.Text and graphical information are shown using monitor 440.

Operating system is on processor 410 and for coordinating and providing in the personal computer system 400 in Fig. 6 Various parts control.Furthermore, it is possible in computer system 400 using computer program with implement it is above-mentioned it is various implement Example.

It would be recognized that hardware component shown in Fig. 4 is only for illustrative purposes, and physical unit may be according to be real The computing device applying the application and dispose and change.

In addition, computer system 400 for example can be desktop computer, server computer, laptop computer or nothing Line equipment, such as mobile phone, personal digital assistant (PDA), handheld computer.

The embodiment provides a kind of effective ways that name entity is extracted in the case of given document collected works.Implement Example solve from the webpage typically organized with least cost extract any types entity the problem of.The weighting name entity proposed Figure can be encoded to the complex relationship between each name entity and the type of other entities, therefore propagate seed on the diagram Confidence level can make up the shortage of network size redundancy, and effective size of the organization can be supported to extract.Furthermore, it is possible to will life Confidence spread on name sterogram is transformed into efficient matrix computations, and it can support the high efficiency extraction on extensive collected works.

It would be recognized that the embodiment in the range of the application can be embodied as to the form of computer program product, computer Program product includes computer executable instructions, such as program code, and it can run on any with reference to appropriate operating system In appropriate computing environment, operating system is, for example, Microsoft Windows, Linux or UNIX operating system.The application scope Interior embodiment can also include program product, and program product includes computer-readable medium can for carrying or storing computer Execute instruction or data structure are thereon.Such computer-readable medium can be it is any can by it is universal or special calculate The usable medium that machine is accessed.For example, such computer-readable medium can include RAM, ROM, EPROM, EEPROM, CD- ROM, magnetic disk storage or other storage devices, or can be used in carrying with form of computer-executable instructions or store desired Program code and any other medium that can be accessed by universal or special computer.

Experimental example

Influence of the mode value to F values when experimental example 1 second is recognized

Used in the second identification step during the second identification in this experimental example, preset value is different, final name entity Recognition result significant difference, this experimental example has investigated influence of the preset value to name Entity recognition result.

The preset value is the mode of at least two identification model；

The name Entity recognition result is weighed by F values, and F values are higher, and recognition result is more reliable, wherein,

The name entity number of accuracy rate (P)=correct number/machine recognition of name Entity recognition,

Name entity number in recall rate (R)=correct number/testing material of name Entity recognition.

F values=2*P*R/ (P+R).

Identification model used during the second identification includes accurate segmentation methods, with new word discovery function in this experimental example The name entity identification algorithms of segmentation methods, structuring perceptron, wherein,

Accurate participle is the segmentation methods of a kind of combination language model, sequence labelling and HMM, it is preferable that Thick cutting is carried out first by N-gram and HMM, CRF is then reused and fritter point；

Segmentation methods with new word discovery function find the neologisms in text by the identification model of rule or statistics；

The problem of structuring perceptron is used to solve sequence labelling.

The result of this experimental example as shown in Fig. 5 and table 1,

Influence of the preset value of table 1 to name Entity recognition result

In Figure 5, broken line A is the corresponding recall rate broken line of each preset value；Broken line B shows the corresponding F values folding of each preset value Line；Broken line C is the corresponding accuracy rate broken line of each preset value.

From Fig. 5 and table 1, in this experimental example, when mode value is 3, F values reach maximum.

Entity recognition result is named when each identification model of experimental example 2 is used alone

A kind of result of identification model to name Entity recognition is used alone in the test of this experimental example, to contrast single identification mould Type merges the reliability of two kinds of name entity recognition methods with many identification models.

Identification model used is respectively CRF identification models used in preliminary identification, the second identification in this experimental example The middle accurate segmentation methods used, the segmentation methods with new word discovery function, the name Entity recognition of structuring perceptron are calculated Method, as a result as shown in Fig. 6 and table 2.

The single identification model of table 2 names the reliability of entity recognition method

In figure 6, broken line A is the corresponding recall rate broken line of each recognition methods；Broken line B shows the corresponding F of each recognition methods It is worth broken line；Broken line C is the corresponding accuracy rate broken line of each recognition methods.

From Fig. 6 and table 2, the name entity recognition method merged with many identification models (name of i.e. many strategy fusions Entity recognition method) and (experimental example 1, mode be 3 result) compare, single identification model name entity recognition method F values compared with It is low, i.e. the name Entity recognition result that the name entity recognition method of many identification models fusion provided with the application is obtained is more To be reliable and stably.

The name Entity recognition result of each identification model of the application method of experimental example 3

This experimental example utilizes the method that the application is provided, and the first recognition result, the second recognition result and the 3rd are calculated respectively Accuracy rate, recall rate and the F values of recognition result, it is as a result as shown in table 3 below.

The name Entity recognition result of each identification model of the application method of table 3

As shown in Table 3, the method provided according to the application, on the basis of the first recognition result and the second recognition result The 3rd recognition result arrived, its accuracy rate, recall rate and F values have raising by a relatively large margin, i.e. the method that the application is provided The new situations such as data scale magnanimity, entity type are diversified, neologisms emerge in an endless stream are coped with, with higher recall rate and standard True rate.

The name entity recognition method and identifying device of many strategy fusions provided according to the application, with following beneficial effect Really：

(1) scheme that the application is provided can be named entity knowledge by preliminary identification step to new data or frontier Not, so as to adapt to when data scale magnanimity, entity type variation, neologisms emerge in an endless stream to name Entity recognition Demand；

(2) second identification steps name the fusion of entity recognition method by many identification models, by multiple weak identification models One strong identification model is combined into, the first recognition result is supplemented, so as to improve recognition result accuracy rate and recall rate；

(3) the name entity obtained using semantic digging system to the second identification carries out role's mark, so as to obtain role Name entity after distribution；

(4) method that the application is provided, which can be migrated easily into new data and frontier, uses；

(5) method that the application is provided has higher accuracy rate and recall rate, and its F value is up to more than 0.8.

The application is described in detail above in association with embodiment and exemplary example, but these explanations are simultaneously It is not intended that the limitation to the application.It will be appreciated by those skilled in the art that in the case of without departing from the application spirit and scope, A variety of equivalencings, modification can be carried out to technical scheme and embodiments thereof or is improved, these each fall within the application In the range of.The protection domain of the application is determined by the appended claims.

Claims

1. a kind of recognition methods of the name entity of many strategy fusions, it is characterised in that including：

Obtain language material；

2. recognition methods according to claim 1, it is characterised in that

First identification model is conditional random field models；

Before name entity in the identification language material using the first identification model, the step of obtaining the first recognition result, also Including：

Set up corpus；

Using the language material after mark as training data, it is trained using CRF kits, obtains first identification model.

3. recognition methods according to claim 2, it is characterised in that described to recognize the language material using the second identification model In name entity, the step of obtaining the second recognition result include：

The name entity in the language material is recognized using at least two identification models, every kind of identification model respectively obtains a son knowledge Other result, generates sub- recognition result list；

Judge whether the recognition result in the sub- recognition result list meets output condition, the identification of output second knot if meeting Really；

The output condition is that in the sub- recognition result list, the number of identical name entity reaches preset value, wherein, institute State the mode that preset value is at least two identification model.

4. recognition methods according to claim 3, it is characterised in that fusion first recognition result and described the Two recognition results, the step of obtaining three recognition results includes：

Judge whether first recognition result meets fusion conditions with second recognition result, merged if meeting, and it is defeated The result gone out after fusion, i.e. the 3rd recognition result；

The fusion refers to increase on the basis of the first recognition result the name entity increased newly in the second recognition result；

The fusion conditions be the second recognition result in exist on the basis of the first recognition result increase newly name entity.

5. recognition methods according to claim 1, it is characterised in that also include after the 3rd recognition result is obtained：

Role's distribution is carried out to the 3rd recognition result using semantic digging system, generation has the name entity of role, its In,

The role is assigned as using semantic digging system, to naming entity to carry out role's mark respectively in the 3rd recognition result Note, and name entity of the output with role respectively；

The semantic digging system includes regular expression and text.

6. a kind of name entity recognition device of many strategy fusions, it is characterised in that the name entity recognition device includes,

Language material acquiring unit, for obtaining language material；

First recognition unit, for recognizing the name entity in the language material using the first identification model, obtains the first identification knot Really；

Second recognition unit, for recognizing the name entity in the language material using the second identification model, obtains the second identification knot Really；

Recognition result integrated unit, for merging first recognition result and second recognition result, obtains the 3rd identification As a result.

7. identifying device according to claim 6, it is characterised in that

First identification model is conditional random field models；

Also include model training unit in first recognition unit, the model training unit is used for：

Set up corpus；

Using the language material after mark as training data, it is trained to obtain first identification model using CRF kits.

8. identifying device according to claim 7, it is characterised in that second recognition unit includes following subelement：

Many strategy recognition units, for recognizing the name entity in the language material, every kind of identification using at least two identification models Model respectively obtains a sub- recognition result, generates sub- recognition result list；

Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output condition, The second recognition result is exported if meeting；

9. identifying device according to claim 8, it is characterised in that

The recognition result integrated unit, melts for judging first recognition result with whether second recognition result meets Conjunction condition, is merged if meeting, and exports the result after fusion, i.e. the 3rd recognition result；

10. identifying device according to claim 6, it is characterised in that also including role's allocation unit, for utilizing semanteme Digging system carries out role's distribution to the 3rd recognition result, and generation has the name entity of role, wherein,

Role's allocation unit is used for using semantic digging system, to naming entity to carry out respectively in the 3rd recognition result Role marks, and name entity of the output with role respectively；

The semantic digging system includes regular expression and text.