CN107330011B - The recognition methods of the name entity of more strategy fusions and device - Google Patents

The recognition methods of the name entity of more strategy fusions and device Download PDF

Info

Publication number
CN107330011B
CN107330011B CN201710447439.2A CN201710447439A CN107330011B CN 107330011 B CN107330011 B CN 107330011B CN 201710447439 A CN201710447439 A CN 201710447439A CN 107330011 B CN107330011 B CN 107330011B
Authority
CN
China
Prior art keywords
recognition result
corpus
name entity
model
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710447439.2A
Other languages
Chinese (zh)
Other versions
CN107330011A (en
Inventor
赵红红
王萌萌
晋耀红
蒋宏飞
杨凯程
董铭慆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co., Ltd
Original Assignee
China Science And Technology (beijing) Co Ltd
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology (beijing) Co Ltd, Beijing Shenzhou Taiyue Software Co Ltd filed Critical China Science And Technology (beijing) Co Ltd
Priority to CN201710447439.2A priority Critical patent/CN107330011B/en
Publication of CN107330011A publication Critical patent/CN107330011A/en
Application granted granted Critical
Publication of CN107330011B publication Critical patent/CN107330011B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

This application discloses a kind of recognition methods of the name entity of more strategy fusions and devices, the name entity in corpus obtained using the identification of the first identification model, obtain the first recognition result, in method provided by the present application, first identification model can update and expand corpus, so as to identify newly generated name entity in corpus, and then first recognition result has higher accuracy rate, the method for recycling more identification model fusions identifies the name entity in the corpus, obtain the second recognition result, it merges first recognition result and the second recognition result obtains third recognition result, semantic digging system is recycled to carry out role's distribution to third recognition result, and export the name entity with role, to realize in data magnanimity, entity type diversification, neologisms emerge one after another when reliably It identifies name entity, and role's distribution is carried out to the name entity identified.

Description

The recognition methods of the name entity of more strategy fusions and device
Technical field
This application involves natural language processing field more particularly to a kind of recognition methods of the name entity of more strategy fusions And device.
Background technique
Naming entity is exactly name, mechanism name, place name and other all entities with entitled mark, it is in text Basic information element is the important carrier of information representation, is the basis of correct understanding and processing text information.Chinese name is real Body identification is one of the basic task in natural language processing field, and main task is to identify that the name occurred in text is real Body and significant numeral classifier phrase are simultaneously sorted out, mainly include name, place name, institution term, temporal expression, the date, Numerical expression etc..
In terms of natural language processing research, name Entity recognition in information retrieval, information extraction, machine translation and text The application fields such as classification play an important role, it can improve information retrieval, abstract extraction, information extraction, machine translation significantly With the performance of the application systems such as text classification, lay a good foundation to obtain knowledge automatically from text.Name Entity recognition accurate The height of rate and recall rate directly decides the performance of the language understandings overall process such as syntactic analysis, semantic analysis.
In recent ten years, domestic and foreign scholars are existing to the entity recognition techniques in text inquires into and furthers investigate extensively.But With the rapid development of Internet, a large amount of random, multi-field text datas constantly increase, to the accurate of name Entity recognition Rate and recall rate are put forward new requirements, in addition, market there is also a need for carrying out role's distribution to the name entity recognized, because This, either caters to the market demand, or improves the accuracy rate and recall rate of identification, name the recognition methods of entity all need into One step is improved.
Currently used name entity recognition method is divided into two major classes: first is that rule-based and knowledge method, second is that base In the method for statistics.Rule-based and knowledge method is a kind of method used earliest, and this method is simple, convenient, disadvantage It is to need a large amount of artificial observation, it is portable poor.Statistics-Based Method will be named Entity recognition to regard a classification as and be asked Topic, using similar support vector machines, the classification methods such as Bayesian model;Name Entity recognition can also be regarded as a sequence simultaneously Column mark problem obtains sequence labelling using machine learning such as Hidden Markov Chain, maximum entropy Markov chain, condition random fields Model.But the above method or presence are difficult to meet and be named to random, the multi-field, texts that make rapid progress a large amount of at present The problem of Entity recognition, or the accuracy rate and recall rate of identification are low.
Such as, Chinese patent CN201610943210.3 disclose a kind of name entity recognition method based on artificial intelligence and Device, this method is by utilizing conditional random field models and the function mould generated according to the retrieval log in preset time period Type, while Entity recognition is named to text to be identified.The defect of the program is preset entity word in its second identification The functional mode that converges is to obtain candidate name entity word all in text to be identified by the methods of dictionary, rule match first Converge, and then judge its height as the confidence level of name entity vocabulary, due to regular method tend to rely on concrete syntax, Field and text formatting, compilation process is time-consuming and is easy to produce mistake, and needs exper ienced linguist that could complete, And the coverage rate of dictionary is relatively low, therefore this method be difficult to meet to it is a large amount of at present it is random, multi-field, make rapid progress texts into Row name Entity recognition.
For another example Chinese patent CN201510889318.4 discloses a kind of name Entity recognition side suitable for social networks Method, this method obtain the first instance probability distribution and test of Training document in the First ray marking model using initial construction After the second instance probability distribution of document, similarity feature is extracted from social network information, is based on similarity feature again later Training obtains the second sequence labelling model, and then obtains carrying out sequence labelling to test document based on the second sequence labelling model The recognition result of entity is named, the accuracy rate and recall rate of final this method are low, and the F value of identification is only 33.19%.
Therefore, need to develop that one kind copes with data scale magnanimity, entity type diversification, neologisms emerge one after another New situation has higher recall rate and accuracy rate, but also can carry out the life of role's distribution to the name entity that identification obtains Name entity recognition method and name entity recognition device.
Summary of the invention
This application provides a kind of recognition methods of the name entity of more strategy fusions and devices, to solve to advise in data In the case that mould magnanimity, entity type diversification, neologisms emerge one after another, to the accuracy rate and recall rate of name Entity recognition It is low, and can not to name entity carry out role's distribution the problem of.
In a first aspect, this application provides a kind of recognition methods of the name entity of more strategy fusions, the recognition methods Include:
Obtain corpus;
The name entity in the corpus is identified using the first identification model, obtains the first recognition result;
The name entity in the corpus is identified using the second identification model, obtains the second recognition result;
First recognition result and second recognition result are merged, third recognition result is obtained.
Optionally, first identification model is conditional random field models.
Optionally, in the name entity using in the first identification model identification corpus, the first recognition result is obtained Before step, further includes:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain first identification using CRF kit Model.
Optionally,
The step of identifying the name entity in the corpus using the second identification model, obtain the second recognition result packet It includes:
The corpus is identified using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list;
Judge whether the recognition result in the sub- recognition result list meets output condition, output second is known if meeting Other result;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.
Optionally, described to identify the name entity in the corpus using the first identification model, obtain the first recognition result The step of include:
The corpus is identified using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list;
Judge whether the recognition result in the sub- recognition result list meets output condition, output first is known if meeting Other result;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.
Second identification model is conditional random field models;
Before described the step of identifying corpus using the second identification model, obtaining the second recognition result, further includes:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain second identification using CRF kit Model.
The fusion first recognition result and second recognition result, the step of obtaining third recognition result packet It includes:
Judge whether first recognition result and second recognition result meet fusion conditions, merged if meeting, And export fused result, that is, third recognition result;
Optionally, the fusion conditions are that there are identical names with second recognition result for first recognition result Entity.
Optionally, after obtaining third recognition result further include: using semantic digging system to the third recognition result Role's distribution is carried out, the name entity with role is generated.
Optionally, the role is assigned as naming entity to divide in the third recognition result using semantic digging system Not carry out role's label, and respectively output have role name entity.
Optionally, the semantic digging system includes regular expression and text.
Second aspect, the application also provide a kind of name entity recognition device of more strategy fusions, and the name entity is known Other device includes,
Corpus acquiring unit, for obtaining corpus;
First recognition unit obtains the first knowledge for identifying the name entity in the corpus using the first identification model Other result;
Second recognition unit obtains the second knowledge for identifying the name entity in the corpus using the second identification model Other result;
Recognition result integrated unit obtains third for merging first recognition result and second recognition result Recognition result.Optionally, first identification model is conditional random field models.
Optionally, first recognition unit further includes model training unit, and the model training unit is used for:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain first identification using CRF kit Model.
Optionally, second recognition unit includes following subelement:
More strategy recognition units, for identifying the name entity in the corpus using at least two identification models, every kind Identification model respectively obtains a sub- recognition result, generates sub- recognition result list;
Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the second recognition result if meeting.
Optionally, the output condition is in the sub- recognition result list, and the number of identical name entity reaches pre- If value, wherein the preset value is the mode of at least two identification model.
Optionally, first recognition unit includes following subelement:
More strategy recognition units, for identifying the name entity in the corpus using at least two identification models, every kind Identification model respectively obtains a sub- recognition result, generates sub- recognition result list;
Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the first recognition result if meeting;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.
Optionally, second identification model is conditional random field models;
Further include model training unit in second recognition unit, the model training unit is used for:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain second identification using CRF kit Model.
Optionally, the recognition result integrated unit, for judging that first recognition result and second identification are tied Whether fruit meets fusion conditions, merges if meeting, and export fused result, that is, third recognition result.
Optionally, the fusion refers to the name for increasing on the basis of the first recognition result and increasing newly in the second recognition result Entity;
Optionally, the fusion conditions are the presence of the name increased newly on the basis of the first recognition result in the second recognition result Entity.
It optionally, further include role's allocation unit, for being carried out using semantic digging system to the third recognition result Role's distribution, generates the name entity with role.
Optionally, role's allocation unit is used for using semantic digging system, to naming in the third recognition result Entity carries out role's label respectively, and output has the name entity of role respectively.
Optionally, the semantic digging system includes regular expression and text.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 shows a kind of method flow of the name entity recognition method of more strategy fusions provided by the embodiments of the present application Figure;
Fig. 2 shows the method flow diagrams of conditional random field models provided by the embodiments of the present application;
Fig. 3 shows the structural schematic diagram of name entity recognition device provided by the embodiments of the present application;
Fig. 4 shows the structural schematic diagram of computer system 400 provided by the embodiments of the present application;
Fig. 5 shows the accuracy rate, recall rate and F value result line chart of experimental example 1;
Fig. 6 shows the accuracy rate, recall rate and F value result line chart of experimental example 2.
Specific embodiment
It is described in detail below by the application, will become more with these explanations the characteristics of the application with advantage It is clear, clear.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.Although each of embodiment is shown in the attached drawings In terms of kind, but unless otherwise indicated, it is not necessary to attached drawing drawn to scale.
The application described below.
According to a first aspect of the present application, a kind of recognition methods of the name entity of more strategy fusions is provided, utilizes first The name entity in corpus that identification model identification obtains, obtains the first recognition result, described in method provided by the present application First identification model can update and expand corpus, so as to identify newly generated name entity, Jin Ersuo in corpus The first recognition result is stated with higher accuracy rate, the method for recycling more identification model fusions identifies the name in the corpus Entity obtains the second recognition result, merges first recognition result and the second recognition result obtains third recognition result, thus It realizes and reliably identifies name entity when data magnanimity, entity type diversification, neologisms emerge one after another, appoint Selection of land recycles semantic digging system to carry out role's distribution to third recognition result, and exports the name entity with role, from And role's distribution is carried out to the name entity identified.
Specifically, as shown in Figure 1, the name entity recognition method includes:
S101 obtains corpus;
S102 identifies the name entity in the corpus using the first identification model, obtains the first recognition result;
S103 identifies the name entity in the corpus using the second identification model, obtains the second recognition result;
S104 merges first recognition result and second recognition result, obtains third recognition result;
Optionally, further including S105 carries out role's distribution to the third recognition result using semantic digging system, generates Name entity with role.
In this application, the corpus refers to the text as training or identification.
In a kind of preferred embodiment of the application, first identification model is conditional random field models, that is, CRF Model (Conditional Random Fields, conditional random field models) has counted global probability in statistics, has done Data are considered when normalization in global distribution, rather than only in local normalization, so as to avoid asking for marking bias Topic.
In this application, as shown in Fig. 2, knowing when the first identification model selects CRF model using the first identification model The not described corpus, before obtaining the first recognition result further include:
S301 establishes corpus;
S302 carries out part-of-speech tagging and sequence labelling to the corpus in the corpus;
S303 is trained to obtain described first using CRF kit using the corpus after mark as training data Identification model.
In this application, the corpus refers to the set of the corpus of identification name entity, e.g., the people for public security system In name recognition method, corpus inventory is exactly notes set;Just for the corpus inventory in medical system name entity recognition method It is case set;The set for the corpus that crawler obtains from network also can be used in the corpus of no specific area.
In this application, the corpus of establishing includes that corpus imports, and imports the corpus in above-mentioned corpus.
In this application, the corpus in the corpus is processed into the format that can be identified by CRF first, that is, to language Material carries out part-of-speech tagging and sequence labelling, the training text string and test text string obtained, wherein the training text after mark This string is used as training data, and the test text string after mark is as test data.
In this application, when to CRF model training, the special characteristic of training data, then root are obtained according to feature templates It is trained according to special characteristic, part-of-speech tagging and sequence labelling result, obtains CRF model, the special characteristic includes up and down Literary feature, part of speech feature etc..
In this application, to CRF model training after training result is tested using test data, when identification tie When the F value of fruit is below 0.8, training data and test data are reacquired, continues to train, using newly obtaining after training Test data is tested, and when the F value of recognition result is less than 0.8, is repeated the above steps, until the F value of training result reaches 0.8 or more, deconditioning, to obtain the first identification model.
In the present embodiment, the name entity indicia that first identification model identifies has first location information.
In the present embodiment, described to identify the name entity in the corpus using the second identification model, obtain second The step of recognition result includes:
The corpus is identified using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list;
Judge whether the recognition result in the sub- recognition result list meets output condition, output second is known if meeting Other result;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.
In this application, at least two identification model includes participle model and Named Entity Extraction Model.
In this application, the participle model includes nGram participle model (single order Markov Chain), HMM participle model (Hidden Markov Model), the participle model with new word discovery function.
In this application, the name physical model includes Named Entity Extraction Model based on maximum entropy, based on structure Change the Named Entity Extraction Model of perceptron.
In this application, the nGram participle model, which passes through to count first, obtains the statistical information of nGram, then basis The statistical information segments the corpus for needing to identify name entity, and this method can look after all possibility, but can also make Index entry increases, as " coming into search engine " can be divided by 2-gram participle: coming into, into searching, search for, index, engine.
In this application, the HMM participle model passes through the participle training set marked, obtains the parameters of HMM, so The corpus for needing to identify name entity is explained using viterbi algorithm afterwards, obtains word segmentation result, it is independent which is based on output Property is not it is assumed that consider contextual feature.
In this application, the participle model with new word discovery function passes through the discovery of the identification model of rule or statistics Name entity in corpus, but relatively depend on training corpus.
In this application, the Named Entity Extraction Model based on maximum entropy can obtain all constraint conditions that meet The model of Information Entropy Maximal in model, and model can be adjusted by setting constraint condition to the fitness of unknown data and right The fitting degree of given data, again, it can also solve the problems, such as parameter smoothing in statistical model naturally.But the model It calculates cost and is spaced apart that pin is larger, and Sparse Problem is than more serious in time.
In this application, feature extraction considers the overall situation in the Named Entity Extraction Model based on structuring perceptron Structuring output, so that model can carry out global structuring study.
In the present embodiment, the name entity that at least two identification model identifies is marked with the second position respectively Information.
In this application, the output condition is in the sub- recognition result list, and the number of identical name entity reaches To preset value, wherein it is whether identical with the whether identical name entity for judging that various identification models identify of second location information, The preset value is the mode of at least two identification model.
Therefore, the obtained recognition result of above-mentioned model is merged, can make up for it the intrinsic deficiency of each model itself, so that knowing Other result is optimal.
In this application, described is to be determined by the F value of experimental result, as shown in the application experimental example 1, when using essence Quasi- segmentation methods (combining language model, sequence labelling and Hidden Markov Model), the participle with new word discovery function are calculated When the name entity identification algorithms of method and structuring perceptron, mode takes 3, as a result optimal.
Applicants have discovered that judging whether to export recognition result in the sub- recognition result list, energy using output condition Enough farthest deletion misrecognitions are not as a result, such as wrong identification, to improve the recall rate of final recognition result.
Applicants have discovered that identifying the corpus using at least two identification models, name can be more accurately identified Entity supplements basic result so that multiple weak identification models are combined into a strong identification model, and then improves identification As a result.
It is described to identify the life in the corpus using the first identification model in the application another preferred embodiment Name entity, the step of obtaining the first recognition result, may also is that
The corpus is identified using at least two identification models, every kind of identification model respectively obtains a sub- recognition result, Generate sub- recognition result list;
Judge whether the recognition result in the sub- recognition result list meets output condition, output first is known if meeting Other result;
In the present embodiment, the name entity that at least two identification model identifies is marked with first position respectively Information.
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value whether identical with the whether identical name entity for judging that various identification models identify of first location information For the mode of at least two identification model.
In the present embodiment, second identification model is conditional random field models, preferably conditional random field models.
In the present embodiment, marking on second recognition result has.
In this application, fusion first recognition result and second recognition result obtain third identification knot The step of fruit includes:
Judge whether first recognition result and second recognition result meet fusion conditions, merged if meeting, And export fused result, that is, third recognition result.
Applicants have discovered that the first recognition result is merged with the second recognition result Ji Wei the first recognition result of removal and the Duplicate name entity in two recognition results so as to avoid the redundancy of data, and then improves the accuracy rate of identification and recalls Rate.
In this application, it is described fusion refer to increase on the basis of the first recognition result in the second recognition result increase newly Name entity.
In this application, the fusion conditions are to exist to increase newly on the basis of the first recognition result in the second recognition result Name entity.
In a kind of preferred embodiment of the application, judge whether second location information and first location information are identical, If it is different, then judging the name entity for name entity newly-increased in the second recognition result.
Optionally, the semantic digging system, names entity to carry out role's label respectively in the third recognition result, And output has the name entity of role respectively.
In this application, the semantic digging system can not only carry out role's distribution, additionally it is possible to name Entity recognition As a result judged, determine whether it is name entity.
The semanteme digging system includes regular expression and text.
For the recognition methods for naming entity for being more fully understood by more strategy fusions described herein, it is set forth below one Specific embodiment is illustrated.
Establish corpus.
To the corpus in corpus, i.e., each subordinate sentence in corpus carries out part-of-speech tagging and sequence labelling, wherein sequence mark The corresponding word of entity will be named to be labeled when note with B, M, E, remaining word is marked with S, the training text string of acquisition.Assuming that one Training text string is " checking in discovery satchel there is Xu Sanguan identity card through people's police ", and annotation results are as shown in table 1.
1 text string of table marks example
Using the corresponding annotation results of a large amount of training text strings as training data, it is trained using CRF.
Assuming that it is that " victim Ni Chengang alarm claims to find mobile phone not in Qinghe Oak Tree gulf that the user being currently received, which inputs corpus, See ".The CRF model obtained using preceding step is inputted corpus to the user and is named Entity recognition, available name Entity " Ni Chengang ".
Supplement amendment is carried out to CRF result using the method that a variety of method integrations learn later, such as accurate word segmentation result will Name Entity recognition in upper example is " Ni Chen ", and structuring perceptron recognition result is " Ni Chengang ", with new word discovery function Recognition result is " Ni Chengang ", takes mode to the recognition result of several method, can determine that name Entity recognition result is " Ni Chen Just ", rather than " Ni Chen ".
It on the one hand can such as " victim's alarm " be determined by regular expression present in semantic digging system or text " Ni Chengang " be correctly name Entity recognition as a result, on the other hand can by role be determined as " victim ".
According to a second aspect of the present application, as shown in figure 3, additionally providing a kind of name Entity recognition dress of more strategy fusions It sets, the name entity recognition devices of more strategy fusions include,
Corpus acquiring unit 201, for obtaining corpus;
First recognition unit 202 obtains first for identifying the name entity in the corpus using the first identification model Recognition result;
Second recognition unit 203 obtains second for identifying the name entity in the corpus using the second identification model Recognition result;
Recognition result integrated unit 204 obtains for merging first recognition result and second recognition result Three recognition results;
Optionally, further include role's allocation unit 205, for using semantic digging system to the third recognition result into Row role distribution, generates the name entity with role.
In a kind of optional embodiment of the application, first identification model is conditional random field models.
Optionally, first recognition unit further includes model training unit, and the model training unit is used for:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain first identification using CRF kit Model.
Optionally, second recognition unit includes following subelement:
More strategy recognition units, for identifying the name entity in the corpus using at least two identification models, every kind Identification model respectively obtains a sub- recognition result, generates sub- recognition result list;
Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the second recognition result if meeting;
Optionally, the output condition is in the sub- recognition result list, and the number of identical name entity reaches pre- If value, wherein the preset value is the mode of at least two identification model.
In another optional embodiment of the application, first recognition unit includes following subelement:
More strategy recognition units, for identifying the name entity in the corpus using at least two identification models, every kind Identification model respectively obtains a sub- recognition result, generates sub- recognition result list;
Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output bars Part exports the first recognition result if meeting;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, In, the preset value is the mode of at least two identification model.
Optionally, second identification model is conditional random field models;
Further include model training unit in second recognition unit, the model training unit is used for:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain second identification using CRF kit Model.
Optionally, the recognition result integrated unit, for judging that first recognition result and second identification are tied Whether fruit meets fusion conditions, merges if meeting, and export fused result, that is, third recognition result.
Optionally, the fusion conditions are that there are identical name entities for the second recognition result and the first recognition result.
Optionally, role's allocation unit is used for using semantic digging system, to naming in the third recognition result Entity carries out role's label respectively, and output has the name entity of role respectively.
Optionally, the semantic digging system includes regular expression and text.
Fig. 4 shows the block diagram that can implement the computer system 400 of embodiment on it.Computer system 400 is wrapped Include processor 410, storage medium 420, system storage 430, monitor 440, keyboard 450, mouse 460,420 and of network interface Video adapter 480.These components are coupled by system bus 490.
Storage medium 420 (such as hard disk) stores multiple programs, including operating system, application program and other program moulds Block.User can be inputted by input equipment into computer system 400 order and information, input equipment be, for example, keyboard 450, Touch tablet (not shown) and mouse 460.Come display text and graphical information using monitor 440.
Operating system is on processor 410 and for coordinating and providing in the personal computer system 400 in Fig. 6 Various parts control.Furthermore, it is possible to using computer program to implement above-mentioned various implementations in computer system 400 Example.
It would be recognized that hardware component shown in Fig. 4 is only for illustrative purposes, and physical unit may be according to being real It applies the application and the calculating equipment disposed and changes.
In addition, computer system 400 for example can be desktop computer, server computer, laptop computer or nothing Line equipment, such as mobile phone, personal digital assistant (PDA), handheld computer etc..
The embodiment provides a kind of effective ways that name entity is extracted in the case where given document collected works.Implement Example solves the problems, such as to extract any type entity from the webpage generally organized with least cost.The weighting name entity proposed Figure can encode the complex relationship between each name entity and the type of other entities, therefore propagate seed on the diagram Confidence level can make up for it the shortage of network size redundancy, and effective size of the organization can be supported to extract.Furthermore, it is possible to will life Confidence spread on name sterogram is transformed into efficient matrix and calculates, and can support the high efficiency extraction on extensive collected works.
It would be recognized that the embodiment within the scope of the application can be embodied as to the form of computer program product, computer Program product includes computer executable instructions, such as program code, can be run in conjunction with any of appropriate operating system Appropriate to calculate environmentally, operating system is, for example, Microsoft Windows, Linux or UNIX operating system.The application range Interior embodiment can also include program product, and program product includes that computer-readable medium can for carrying or storing computer Execute instruction or data structure thereon.Such computer-readable medium can be it is any can be by general or specialized calculating The usable medium of machine access.For example, such computer-readable medium may include RAM, ROM, EPROM, EEPROM, CD- ROM, magnetic disk storage or other storage devices, or can be used in carrying or storing desired with form of computer-executable instructions Program code and any other medium that can be accessed by general or specialized computer.
Experimental example
Influence of the mode value to F value when experimental example 1 second identifies
Used in the second identification step when the second identification in this experimental example, preset value is different, final name entity Recognition result significant difference, this experimental example have investigated influence of the preset value to name Entity recognition result.
The preset value is the mode of at least two identification model;
The name Entity recognition result is measured by F value, and F value is higher, and recognition result is more reliable, wherein
Accuracy rate (P)=name Entity recognition correct number/machine recognition name entity number,
Name entity number in recall rate (R)=correct number/testing material of name Entity recognition.
F value=2*P*R/ (P+R).
Identification model used when the second identification includes accurate segmentation methods, with new word discovery function in this experimental example The name entity identification algorithms of segmentation methods, structuring perceptron, wherein
Precisely participle is the segmentation methods of a kind of combination language model, sequence labelling and Hidden Markov Model, it is preferable that Thick cutting is carried out using N-gram and Hidden Markov Model first, CRF is then reused and fritter point;
Segmentation methods with new word discovery function find the neologisms in text by the identification model of rule or statistics;
Structuring perceptron is for solving the problems, such as sequence labelling.
The result of this experimental example as shown in Fig. 5 and table 1,
Influence of 1 preset value of table to name Entity recognition result
In Fig. 5, broken line A is the corresponding recall rate broken line of each preset value;Broken line B shows the corresponding F value folding of each preset value Line;Broken line C is the corresponding accuracy rate broken line of each preset value.
By Fig. 5 and table 1 it is found that in this experimental example, when mode value is 3, F value reaches maximum.
Entity recognition result is named when each identification model of experimental example 2 is used alone
A kind of identification model is used alone to name Entity recognition as a result, to compare single identification mould in the test of this experimental example Type merges the reliability of two kinds of name entity recognition methods with more identification models.
Identification model used is respectively CRF identification model used in preliminary identification, the second identification in this experimental example Used in precisely segmentation methods, the segmentation methods with new word discovery function, structuring perceptron name Entity recognition calculate Method, as a result as shown in Fig. 6 and table 2.
The reliability of the single identification model name entity recognition method of table 2
In Fig. 6, broken line A is the corresponding recall rate broken line of each recognition methods;Broken line B shows the corresponding F of each recognition methods It is worth broken line;Broken line C is the corresponding accuracy rate broken line of each recognition methods.
By Fig. 6 and table 2 it is found that the name entity recognition method (name of i.e. more strategy fusions merged with more identification models Entity recognition method) and (experimental example 1, mode be 3 result) compare, single identification model name entity recognition method F value compared with It is low, that is, the name Entity recognition result obtained with the name entity recognition method of more identification models fusion provided by the present application is more It is reliable and stable.
The name Entity recognition result of 3 each identification model of the application method of experimental example
This experimental example utilizes method provided by the present application, calculates separately the first recognition result, the second recognition result and third Accuracy rate, recall rate and the F value of recognition result, as a result as shown in table 3 below.
The name Entity recognition result of 3 each identification model of the application method of table
As shown in Table 3, according to method provided by the present application, on the basis of the first recognition result and the second recognition result The third recognition result arrived, accuracy rate, recall rate and F value have raising by a relatively large margin, that is, method provided by the present application The new situations such as data scale magnanimity, entity type are diversified, neologisms emerge one after another are coped with, there is higher recall rate and standard True rate.
According to the name entity recognition method and identification device of more strategy fusions provided by the present application, have below beneficial to effect Fruit:
(1) scheme provided by the present application can be named entity to new data or frontier by preliminary identification step and know Not, thus adapt to data scale magnanimity, when entity type diversification, neologisms emerge one after another to name Entity recognition Demand;
(2) second identification steps name the fusion of entity recognition method by more identification models, by multiple weak identification models It is combined into a strong identification model, the first recognition result is supplemented, to improve recognition result accuracy rate and recall rate;
(3) role's label is carried out using the name entity that semantic digging system obtains the second identification, to obtain role Name entity after distribution;
(4) method provided by the present application can easily be migrated into new data and frontier and be used;
(5) method provided by the present application accuracy rate with higher and recall rate, F value is up to 0.8 or more.
Combine detailed description and exemplary example that the application is described in detail above, but these explanations are simultaneously It should not be understood as the limitation to the application.It will be appreciated by those skilled in the art that without departing from the application spirit and scope, A variety of equivalent substitution, modification or improvements can be carried out to technical scheme and embodiments thereof, these each fall within the application In the range of.The protection scope of the application is determined by the appended claims.

Claims (4)

1. a kind of recognition methods of the name entity of more strategy fusions characterized by comprising
Obtain corpus;
The name entity in the corpus is identified using the first identification model, obtains the first recognition result;
The name entity in the corpus is identified using the second identification model, obtains the second recognition result;
First recognition result and second recognition result are merged, third recognition result is obtained;
It is described using the second identification model identify the name entity in the corpus, obtain the second recognition result the step of include:
The name entity in the corpus is identified using at least two identification models, and every kind of identification model respectively obtains a son and knows Not as a result, generating sub- recognition result list;
Judge whether the recognition result in the sub- recognition result list meets output condition, the second identification of output knot if meeting Fruit;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, wherein institute State the mode that preset value is at least two identification model;
At least two identification model includes participle model and Named Entity Extraction Model, wherein the participle model includes NGram participle model, HMM participle model, the participle model with new word discovery function, the name physical model include being based on The Named Entity Extraction Model of maximum entropy, the Named Entity Extraction Model based on structuring perceptron;
The fusion first recognition result and second recognition result, the step of obtaining third recognition result include:
Judge whether first recognition result and second recognition result meet fusion conditions, is merged if meeting, and defeated Fused result out, that is, third recognition result;
The fusion refers to the name entity for increasing on the basis of the first recognition result and increasing newly in the second recognition result;
The fusion conditions are the presence of the name entity increased newly on the basis of the first recognition result in the second recognition result;
After obtaining third recognition result further include:
Role's distribution is carried out to the third recognition result using semantic digging system, generates the name entity with role, In,
The role is assigned as naming entity to carry out role's mark respectively in the third recognition result using semantic digging system Note, and output has the name entity of role respectively;
The semanteme digging system includes regular expression and text.
2. recognition methods according to claim 1, which is characterized in that
First identification model is conditional random field models;
Before described the step of utilizing the first identification model to identify the name entity in corpus, obtaining the first recognition result, also Include:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
It using the corpus after mark as training data, is trained using CRF kit, obtains first identification model.
3. a kind of name entity recognition device of more strategy fusion, which is characterized in that the name entity recognition device includes,
Corpus acquiring unit, for obtaining corpus;
First recognition unit obtains the first identification knot for identifying the name entity in the corpus using the first identification model Fruit;
Second recognition unit obtains the second identification knot for identifying the name entity in the corpus using the second identification model Fruit;
Recognition result integrated unit obtains third identification for merging first recognition result and second recognition result As a result;
Second recognition unit includes following subelement:
More strategy recognition units, for identifying the name entity in the corpus, every kind of identification using at least two identification models Model respectively obtains a sub- recognition result, generates sub- recognition result list;
Recognition result output unit, for judging whether the recognition result in the sub- recognition result list meets output condition, The second recognition result is exported if meeting;
The output condition is in the sub- recognition result list, and the number of identical name entity reaches preset value, wherein institute State the mode that preset value is at least two identification model;
At least two identification model includes participle model and Named Entity Extraction Model, wherein the participle model includes NGram participle model, HMM participle model, the participle model with new word discovery function, the name physical model include being based on The Named Entity Extraction Model of maximum entropy, the Named Entity Extraction Model based on structuring perceptron;
The recognition result integrated unit melts for judging first recognition result with whether second recognition result meets Conjunction condition merges if meeting, and exports fused result, that is, third recognition result;
The fusion refers to the name entity for increasing on the basis of the first recognition result and increasing newly in the second recognition result;
The fusion conditions are the presence of the name entity increased newly on the basis of the first recognition result in the second recognition result;
The name entity recognition device further includes role's allocation unit, for being identified using semantic digging system to the third As a result role's distribution is carried out, the name entity with role is generated, wherein
Role's allocation unit is used to name entity to carry out respectively in the third recognition result using semantic digging system Role's label, and output has the name entity of role respectively;
The semanteme digging system includes regular expression and text.
4. identification device according to claim 3, which is characterized in that
First identification model is conditional random field models;
Further include model training unit in first recognition unit, the model training unit is used for:
Establish corpus;
Part-of-speech tagging and sequence labelling are carried out to the corpus in the corpus;
Using the corpus after mark as training data, it is trained to obtain first identification model using CRF kit.
CN201710447439.2A 2017-06-14 2017-06-14 The recognition methods of the name entity of more strategy fusions and device Active CN107330011B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710447439.2A CN107330011B (en) 2017-06-14 2017-06-14 The recognition methods of the name entity of more strategy fusions and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710447439.2A CN107330011B (en) 2017-06-14 2017-06-14 The recognition methods of the name entity of more strategy fusions and device

Publications (2)

Publication Number Publication Date
CN107330011A CN107330011A (en) 2017-11-07
CN107330011B true CN107330011B (en) 2019-03-26

Family

ID=60195026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710447439.2A Active CN107330011B (en) 2017-06-14 2017-06-14 The recognition methods of the name entity of more strategy fusions and device

Country Status (1)

Country Link
CN (1) CN107330011B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108350B (en) * 2017-11-29 2021-09-14 北京小米移动软件有限公司 Noun recognition method and device
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108388638B (en) * 2018-02-26 2020-09-18 出门问问信息科技有限公司 Semantic parsing method, device, equipment and storage medium
CN108363701B (en) * 2018-04-13 2022-06-28 达而观信息科技(上海)有限公司 Named entity identification method and system
CN108829681B (en) * 2018-06-28 2022-11-11 鼎富智能科技有限公司 Named entity extraction method and device
CN109086274B (en) * 2018-08-23 2020-06-26 电子科技大学 English social media short text time expression recognition method based on constraint model
CN111178073A (en) * 2018-10-23 2020-05-19 北京嘀嘀无限科技发展有限公司 Text processing method and device, electronic equipment and storage medium
CN109543153B (en) * 2018-11-13 2023-08-18 成都数联铭品科技有限公司 Sequence labeling system and method
CN109791570B (en) * 2018-12-13 2023-05-02 香港应用科技研究院有限公司 Efficient and accurate named entity recognition method and device
CN109815296B (en) * 2018-12-29 2020-12-22 北京中科闻歌科技股份有限公司 Figure knowledge base construction method and device for notarization document and storage medium
CN111488737B (en) * 2019-01-09 2023-04-14 阿里巴巴集团控股有限公司 Text recognition method, device and equipment
CN109886270B (en) * 2019-01-17 2022-03-01 大连理工大学 Case element identification method for electronic file record text
CN110134949B (en) * 2019-04-26 2022-10-28 网宿科技股份有限公司 Text labeling method and equipment based on teacher supervision
CN110110327B (en) * 2019-04-26 2021-06-22 网宿科技股份有限公司 Text labeling method and equipment based on counterstudy
CN110489727B (en) * 2019-07-12 2023-07-07 深圳追一科技有限公司 Person name recognition method and related device
CN110688467A (en) * 2019-08-23 2020-01-14 北京百度网讯科技有限公司 Named entity recognition method and device, computer equipment and storage medium
CN110569332B (en) * 2019-09-09 2023-01-06 腾讯科技(深圳)有限公司 Sentence feature extraction processing method and device
CN110750991B (en) * 2019-09-18 2022-04-15 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN111178075A (en) * 2019-12-19 2020-05-19 厦门快商通科技股份有限公司 Online customer service log analysis method, device and equipment
CN111125438B (en) * 2019-12-25 2023-06-27 北京百度网讯科技有限公司 Entity information extraction method and device, electronic equipment and storage medium
CN113051918A (en) * 2019-12-26 2021-06-29 北京中科闻歌科技股份有限公司 Named entity identification method, device, equipment and medium based on ensemble learning
CN111400429B (en) * 2020-03-09 2023-06-30 北京奇艺世纪科技有限公司 Text entry searching method, device, system and storage medium
CN111797629B (en) * 2020-06-23 2022-07-29 平安医疗健康管理股份有限公司 Method and device for processing medical text data, computer equipment and storage medium
CN112270173B (en) * 2020-10-27 2021-10-26 北京百度网讯科技有限公司 Character mining method and device in text, electronic equipment and storage medium
CN112541065A (en) * 2020-12-11 2021-03-23 浙江汉德瑞智能科技有限公司 Medical new word discovery processing method based on representation learning
CN113127645B (en) * 2021-04-09 2022-09-13 厦门渊亭信息科技有限公司 Automatic extraction method of large-scale knowledge graph body, terminal equipment and storage medium
CN113127060A (en) * 2021-04-09 2021-07-16 中通服软件科技有限公司 Software function point identification method based on natural language pre-training model (BERT)
CN113971216B (en) * 2021-10-22 2023-02-03 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN102033879B (en) * 2009-09-27 2015-02-18 深圳市世纪光速信息技术有限公司 Method and device for identifying Chinese name
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN104572631B (en) * 2014-12-03 2018-04-13 北京捷通华声语音技术有限公司 The training method and system of a kind of language model
CN104933152B (en) * 2015-06-24 2018-09-14 北京京东尚科信息技术有限公司 Name entity recognition method and device
CN106326206B (en) * 2015-06-24 2021-01-26 北京京东尚科信息技术有限公司 Entity extraction method based on grammar template
CN106202255A (en) * 2016-06-30 2016-12-07 昆明理工大学 Merge the Vietnamese name entity recognition method of physical characteristics
CN106570132B (en) * 2016-10-27 2020-01-14 浙江大学 Document vector learning method integrating mention entity information
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106649272B (en) * 2016-12-23 2019-06-25 东北大学 A kind of name entity recognition method based on mixed model

Also Published As

Publication number Publication date
CN107330011A (en) 2017-11-07

Similar Documents

Publication Publication Date Title
CN107330011B (en) The recognition methods of the name entity of more strategy fusions and device
CN110097085B (en) Lyric text generation method, training method, device, server and storage medium
CN104050160B (en) Interpreter's method and apparatus that a kind of machine is blended with human translation
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
CN109960800A (en) Weakly supervised file classification method and device based on Active Learning
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN107943847A (en) Business connection extracting method, device and storage medium
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN105653590A (en) Name duplication disambiguation method of Chinese literature authors
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN107391575A (en) A kind of implicit features recognition methods of word-based vector model
CN106777957B (en) The new method of biomedical more ginseng event extractions on unbalanced dataset
CN108647225A (en) A kind of electric business grey black production public sentiment automatic mining method and system
CN101539907A (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN106570180A (en) Artificial intelligence based voice searching method and device
CN109271640B (en) Text information region attribute identification method and device and electronic equipment
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN110287482A (en) Semi-automation participle corpus labeling training device
CN110188359B (en) Text entity extraction method
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN107092605A (en) A kind of entity link method and device
CN111160041A (en) Semantic understanding method and device, electronic equipment and storage medium
CN104809105A (en) Method and system for identifying event argument and argument role based on maximum entropy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhao Honghong

Inventor after: Wang Mengmeng

Inventor after: Jin Yaohong

Inventor after: Jiang Hongfei

Inventor after: Yang Kaicheng

Inventor after: Dong Mingtao

Inventor before: Zhao Honghong

Inventor before: Wang Mengmeng

Inventor before: Jin Yaohong

Inventor before: Jiang Hongfei

Inventor before: Yang Kaicheng

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190904

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee after: China Science and Technology (Beijing) Co., Ltd.

Address before: Room 601, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Co-patentee before: China Science and Technology (Beijing) Co., Ltd.

Patentee before: Beijing Shenzhou Taiyue Software Co., Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Patentee after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.