CN105138515B

CN105138515B - Name entity recognition method and device

Info

Publication number: CN105138515B
Application number: CN201510556751.6A
Authority: CN
Inventors: 张涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2018-10-19
Anticipated expiration: 2035-09-02
Also published as: CN105138515A

Abstract

A kind of name entity recognition method of present invention proposition and device, the name entity recognition method include：Pre-identification, the initial name entity identified are carried out to text to be identified according to preset rules, the preset rules include：Rule-based dictionary and be based on statistical model；Determine the classification belonging to the text to be identified；According to the classification and the initial name entity, combine text is obtained, and final name entity is determined according to the combine text.This method can be to there are the unconspicuous name entities of the name entity and feature of ambiguity, it may have preferable recognition effect.

Description

Name entity recognition method and device

Technical field

The present invention relates to natural language processing technique field more particularly to a kind of name entity recognition methods and device.

Background technology

The main task of name Entity recognition is to identify the proprietary names such as name, place name in text.Traditional name Entity recognition method is broadly divided into the method for rule-based dictionary and the method based on statistical model.The method of rule-based dictionary Mainly it is identified in a manner of string matching building large-scale entity dictionary under line.Side based on statistical model Method is mainly by building statistical model, using the training corpus manually marked come training pattern to be identified.But base The name entity except dictionary cannot be identified in the mode of regular dictionary, and even if in dictionary, the side of rule-based dictionary Method can not solve name entity ambiguity problem.Based on the method for statistical model to the name entity of not obvious characteristic, such as song The recognition effects such as name, video display name are poor.

Invention content

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of name entity recognition method, this method is to there are ambiguities Name entity and the unconspicuous name entity of feature, it may have preferable recognition effect.

It is another object of the present invention to propose a kind of name entity recognition device.

In order to achieve the above objectives, the name entity recognition method that first aspect present invention embodiment proposes, including：According to pre- If rule carries out pre-identification, the initial name entity identified to text to be identified, the preset rules include：It is based on Regular dictionary and be based on statistical model；Determine the classification belonging to the text to be identified；According to the classification and described initial Entity is named, obtains combine text, and determine final name entity according to the combine text.

The name entity recognition method that first aspect present invention embodiment proposes uses rule-based word when passing through pre-identification Allusion quotation and mode based on statistical model can expand the range of initial name entity, solve merely using based on statistical model Mode is unable to the problem of identification feature unconspicuous name entity；By classifying to text to be identified, list can be solved Caused by the mode of pure rule-based dictionary name entity ambiguity problem, to there are the name entities and feature of ambiguity not Apparent name entity, it may have preferable recognition effect.

In order to achieve the above objectives, the name entity recognition device that second aspect of the present invention embodiment proposes, including：Pretreatment Module, for carrying out pre-identification to text to be identified according to preset rules, the initial name entity identified is described pre- If rule includes：Rule-based dictionary and be based on statistical model；Sort module, for determining belonging to the text to be identified Classification；Post-processing module, for according to the classification and the initial name entity, obtaining combine text, and according to described group It closes text and determines final name entity.

The name entity recognition device that second aspect of the present invention embodiment proposes uses rule-based word when passing through pre-identification Allusion quotation and mode based on statistical model can expand the range of initial name entity, solve merely using based on statistical model Mode is unable to the problem of identification feature unconspicuous name entity；By classifying to text to be identified, list can be solved Caused by the mode of pure rule-based dictionary name entity ambiguity problem, to there are the name entities and feature of ambiguity not Apparent name entity, it may have preferable recognition effect.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein：

Fig. 1 is the flow diagram for the name entity recognition method that one embodiment of the invention proposes；

Fig. 2 is the flow diagram for the name entity recognition method that another embodiment of the present invention proposes；

Fig. 3 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes；

Fig. 4 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes.

Specific implementation mode

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar module or module with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the present invention, and is not considered as limiting the invention.On the contrary, this The embodiment of invention includes all changes fallen within the scope of the spiritual and intension of attached claims, modification and is equal Object.

Fig. 1 is the flow diagram for the name entity recognition method that one embodiment of the invention proposes, this method includes：

S11：Pre-identification, the initial name entity identified, institute are carried out to text to be identified according to preset rules Stating preset rules includes：Rule-based dictionary and be based on statistical model.

Name Entity recognition in the present embodiment can apply the scene in a variety of needs, such as apply in phonetic synthesis In.It needs first to carry out text-processing to input text in phonetic synthesis, to treated, text carries out prosody prediction, sound later Learn parameter generation etc., the voice synthesized.Wherein, name Entity recognition can be as a basic step for being text-processing Suddenly.

In the present embodiment, by using rule-based dictionary and based on the mode of statistical model, relative to only with wherein One of mode, can acquisition as much as possible name entity.

For example, in the mode of rule-based dictionary, it is the mode based on string matching, can identifies song title, shadow Depending on the unconspicuous entity of the features such as name, to solve that based on statistical model the unobvious features such as song title, video display name cannot be obtained Name entity the problem of.

In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the apparent entity of the features such as some names, place name can be identified.

For example, text to be identified is：" wanting to listen the lustily water of Liu De China well ", can according to the mode of rule-based dictionary Include with the name entity identified：" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", according to Mode based on statistical model, the name entity that can be identified include：" Liu Dehua (name) ".

Therefore, the initial name entity obtained after pre-identification includes：" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", " Liu Dehua (name) ".

S12：Determine the classification belonging to the text to be identified.

Text categories are pre-defined classifications, such as：Music class, video display class, game class etc..

Corresponding text class can be determined according to the text message in the name entity identified and text to be identified Not.Specifically, characteristic information can be extracted from the name entity and text message identified, used according to characteristic information The Algorithm of documents categorization of maximum entropy determines the classification belonging to text.

In the present embodiment, characteristic information includes：Word in text to be identified, it is initial to name entity class previous with it Contamination, it is initial to name entity class and its latter contamination.

In the present embodiment, by selecting name entity, as characteristic information, name can be used with a word before and after it The contextual information of entity carries out the qi that disappears to name entity, and asking for ambiguousness may be carried by solving individually name entity itself Topic.

For example, on the basis of above-mentioned text to be identified, the characteristic information of selection includes：Think well, listen, Liu Dehua, , lustily water, song_ listen, s_song, listen _ singer, singer_, _ song, song_e, listen _ per, per_.Its In, song indicates that song title, singer indicate that singer's name, per indicate that name, s indicate that a word before beginning of the sentence, e indicate sentence The subsequent word of tail.

After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Originally the text categories belonged to.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, And maximum entropy Algorithm of documents categorization, it may be determined that the text categories that text to be identified belongs to, for example, above-mentioned is to be identified Text belongs to music class.

S13：According to the classification and the initial name entity, combine text is obtained, and true according to the combine text Fixed final name entity.

In combination, can specifically include：It obtains and belongs to the initial name entity of the classification, according to belonging to the classification Initial name entity and text to be identified in remaining word be combined, obtain combine text.

For example, when it is music class to determine classification, the initial name entity for belonging to music class can be obtained, such as includes：It is good Think (song title), Liu Dehua (singer's name), lustily water (song title).Later, can by these initial name entities with it is to be identified Text in remaining word be combined, remaining word includes：" listening ", " ", then after combination, obtained combine text packet It includes：" song listens the lustily water of singer ", " wanting to listen the song of singer well ", " song listens the song of Liu De China " etc..

After obtaining multiple combine texts as shown above, each combine text can be analyzed, to determine most Which combine text whole name entity analyzes more like in short for example, by the way of language model, later will be more like one Initial name entity in the combine text of word is determined as final name entity.Specifically, can be by excavating sound under line The training corpus of happy class, it is assumed that training corpus shows that the probability of occurrence of " wanting to listen the song of singer well " is maximum, then can determine Going out final name entity includes：Liu Dehua (singer name), lustily water (song title).

In the present embodiment, when pre-identification using rule-based dictionary and based on statistical model by way of, can expand The range of initial name entity, solution are unable to the unconspicuous name entity of identification feature by the way of based on statistical model merely The problem of；By classifying to text to be identified, can solve to name caused by being based purely on the mode of regular dictionary real Body ambiguity problem, to there are the unconspicuous name entities of the name entity and feature of ambiguity, it may have preferable identification Effect.

Fig. 2 is the flow diagram for the name entity recognition method that another embodiment of the present invention proposes, this method includes：

S21：Pre-identification, the initial name entity identified, institute are carried out to text to be identified according to preset rules Stating preset rules includes：Rule-based dictionary and be based on statistical model.

In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the obvious entity type of the aspect ratios such as some names, place name can be identified.

It in the mode based on statistical model, can take as conventional method, be that basic unit is divided with word Class.Such as the form of a text (query) in training corpus:

From O

Horse LOC_S

Saddle LOC_M

Mountain LOC_E

To O

Peaceful O

Wave O

Why O

O

Walk O

Loc indicates that place name, LOC_S indicate that the word that place name starts, LOC_E indicate the word that place name terminates, LOC_M Indicate the middle word of place name.

The similar mark to place name can also identify name by the way of based on statistical model.

S22：According to the contextual information of initial the name entity and text to be identified, characteristic information is obtained；

S23：According to the characteristic information and pre-set text sorting algorithm, the classification that text to be identified belongs to is determined.

After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Classification belonging to this.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, and Maximum entropy Algorithm of documents categorization, it may be determined that the classification that text to be identified belongs to, for example, above-mentioned text to be identified belongs to Music class.

S24：The initial name entity for belonging to the classification is obtained, according to the initial name entity for belonging to the classification and is waited for Remaining word in the text of identification is combined, and obtains combine text.

S25：Obtain the training corpus for belonging to the classification collected in advance；

For example, collecting the training corpus of a large amount of music class.

S26：The probability of occurrence of each combine text is determined according to training corpus；

For example, counting the occurrence number of the training text of each combine text form in training corpus, " song is such as counted Listen singer ... " occurrence number of the training text of this form, the instruction of this form of statistics " ... listen the song of singer " The occurrence number for practicing text, later again with the training text of the occurrence number divided by form of ownership of the training text of each form Total occurrence number obtains the probability of occurrence of the corresponding combination text, and such as total occurrence number is M, " song listens singer's ... " The occurrence number of the training text of this form is N, then the probability of occurrence of " song listens the lustily water of singer " is N/M.

S27：By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final name entity.

For example, the probability of occurrence of combine text " wanting to listen the song of singer well " is maximum, then it is initial in the combine text Name entity is exactly final name entity, i.e., final name entity includes：Liu Dehua (singer's name) and lustily water (song Name).

In the present embodiment, when pre-identification using rule-based dictionary and based on statistical model by way of, can expand The range of initial name entity, solution are unable to the unconspicuous name entity of identification feature by the way of based on statistical model merely The problem of；It, can be in conjunction with text to be identified by selecting initially to name entity and its preceding the latter word as characteristic information Contextual information, solution names entity ambiguity problem caused by being based purely on the mode of regular dictionary, to there are ambiguities Name entity and the unconspicuous name entity of feature, it may have preferable recognition effect.

Fig. 3 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes, which includes： Preprocessing module 31, sort module 32 and post-processing module 33.

Preprocessing module 31, for carrying out pre-identification to text to be identified according to preset rules, that is identified is first Begin name entity, and the preset rules include：Rule-based dictionary and be based on statistical model；

In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the apparent entity class of the features such as some names, place name can be identified.

Sort module 32, for determining the classification belonging to the text to be identified；

In some embodiments, the sort module 32 is specifically used for：

According to the contextual information of initial the name entity and text to be identified, characteristic information is obtained；

According to the characteristic information and pre-set text sorting algorithm, the classification that text to be identified belongs to is determined.

Optionally, the characteristic information includes：

Word in text to be identified, initial name entity class and its previous contamination, and, initial name is real Body classification and its latter contamination.

Corresponding classification can be determined according to the text message in the name entity identified and text to be identified. Specifically, characteristic information can be extracted from the name entity and text message identified, according to characteristic information using most The Algorithm of documents categorization of big entropy, determines the classification that text belongs to.

After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Originally the text categories belonged to.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, And maximum entropy Algorithm of documents categorization, it may be determined that the classification belonging to text to be identified, for example, above-mentioned text to be identified Belong to music class.

Post-processing module 33, for according to the classification and the initial name entity, obtaining combine text, and according to institute It states combine text and determines final name entity.

In some embodiments, referring to Fig. 4, the post-processing module 33 includes：

First unit 331, for obtaining the initial name entity for belonging to the classification, according to belonging to the initial of the classification Remaining word in name entity and text to be identified is combined, and obtains combine text.

Second unit 332, for obtaining the training corpus for belonging to the classification collected in advance；It is determined according to training corpus The probability of occurrence of each combine text；By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final Name entity.

In combination, can specifically include：It obtains and belongs to the initial name entity of the classification, according to belonging to the text Remaining word in the initial name entity of classification and text to be identified is combined, and obtains combine text.

It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indicating or implying relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " Refer at least two.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.

It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of row technology or their combination are realized：With the logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also That each unit physically exists alone, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and when sold or used as an independent product, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiments or example in can be combined in any suitable manner.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims

1. a kind of name entity recognition method, which is characterized in that including：

Pre-identification, the initial name entity identified, the default rule are carried out to text to be identified according to preset rules Include then：Rule-based dictionary and be based on statistical model, wherein it is described rule dictionary be string matching, the statistical model For conditional random field models；

Determine the classification belonging to the text to be identified；

Wherein, the classification belonging to the determination text to be identified, including：

The word of text to be identified and initial name entity class are determined, according to the word of text to be identified and the initial name Entity class obtains characteristic information, wherein the characteristic information includes：Word in text to be identified, initially names entity class Not with its previous contamination, and, it is initial to name entity class and its latter contamination；

According to the characteristic information and pre-set text sorting algorithm, the classification belonging to text to be identified is determined；

According to the classification and the initial name entity, combine text is obtained, and determine finally according to the combine text Name entity；

Wherein, described according to the classification and the initial name entity, combine text is obtained, including：

The initial name entity for belonging to the classification is obtained, according to the initial name entity and text to be identified for belonging to the classification Remaining word in this is combined, and obtains combine text.

2. according to the method described in claim 1, it is characterized in that, described determine that final name is real according to the combine text Body, including：

Obtain the training corpus for belonging to the classification collected in advance；

The probability of occurrence of each combine text is determined according to training corpus；

By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final name entity.

3. a kind of name entity recognition device, which is characterized in that including：

Preprocessing module, for carrying out pre-identification, the initial name identified to text to be identified according to preset rules Entity, the preset rules include：Rule-based dictionary and be based on statistical model, wherein it is described rule dictionary be character string Match, the statistical model is conditional random field models；

Sort module, for determining the classification belonging to the text to be identified；

Wherein, the sort module, is specifically used for：

Post-processing module, for according to the classification and the initial name entity, obtaining combine text, and according to the combination Text determines final name entity；

The post-processing module includes：

First unit, it is real according to the initial name for belonging to the classification for obtaining the initial name entity for belonging to the classification Remaining word in body and text to be identified is combined, and obtains combine text.

4. device according to claim 3, which is characterized in that the post-processing module includes：

Second unit, for obtaining the training corpus for belonging to the classification collected in advance；Each group is determined according to training corpus Close the probability of occurrence of text；By the initial name entity in the maximum combine text of probability of occurrence, it is real to be determined as final name Body.