CN105138515B - Name entity recognition method and device - Google Patents

Name entity recognition method and device Download PDF

Info

Publication number
CN105138515B
CN105138515B CN201510556751.6A CN201510556751A CN105138515B CN 105138515 B CN105138515 B CN 105138515B CN 201510556751 A CN201510556751 A CN 201510556751A CN 105138515 B CN105138515 B CN 105138515B
Authority
CN
China
Prior art keywords
text
name entity
identified
classification
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510556751.6A
Other languages
Chinese (zh)
Other versions
CN105138515A (en
Inventor
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510556751.6A priority Critical patent/CN105138515B/en
Publication of CN105138515A publication Critical patent/CN105138515A/en
Application granted granted Critical
Publication of CN105138515B publication Critical patent/CN105138515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of name entity recognition method of present invention proposition and device, the name entity recognition method include:Pre-identification, the initial name entity identified are carried out to text to be identified according to preset rules, the preset rules include:Rule-based dictionary and be based on statistical model;Determine the classification belonging to the text to be identified;According to the classification and the initial name entity, combine text is obtained, and final name entity is determined according to the combine text.This method can be to there are the unconspicuous name entities of the name entity and feature of ambiguity, it may have preferable recognition effect.

Description

Name entity recognition method and device
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of name entity recognition methods and device.
Background technology
The main task of name Entity recognition is to identify the proprietary names such as name, place name in text.Traditional name Entity recognition method is broadly divided into the method for rule-based dictionary and the method based on statistical model.The method of rule-based dictionary Mainly it is identified in a manner of string matching building large-scale entity dictionary under line.Side based on statistical model Method is mainly by building statistical model, using the training corpus manually marked come training pattern to be identified.But base The name entity except dictionary cannot be identified in the mode of regular dictionary, and even if in dictionary, the side of rule-based dictionary Method can not solve name entity ambiguity problem.Based on the method for statistical model to the name entity of not obvious characteristic, such as song The recognition effects such as name, video display name are poor.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, an object of the present invention is to provide a kind of name entity recognition method, this method is to there are ambiguities Name entity and the unconspicuous name entity of feature, it may have preferable recognition effect.
It is another object of the present invention to propose a kind of name entity recognition device.
In order to achieve the above objectives, the name entity recognition method that first aspect present invention embodiment proposes, including:According to pre- If rule carries out pre-identification, the initial name entity identified to text to be identified, the preset rules include:It is based on Regular dictionary and be based on statistical model;Determine the classification belonging to the text to be identified;According to the classification and described initial Entity is named, obtains combine text, and determine final name entity according to the combine text.
The name entity recognition method that first aspect present invention embodiment proposes uses rule-based word when passing through pre-identification Allusion quotation and mode based on statistical model can expand the range of initial name entity, solve merely using based on statistical model Mode is unable to the problem of identification feature unconspicuous name entity;By classifying to text to be identified, list can be solved Caused by the mode of pure rule-based dictionary name entity ambiguity problem, to there are the name entities and feature of ambiguity not Apparent name entity, it may have preferable recognition effect.
In order to achieve the above objectives, the name entity recognition device that second aspect of the present invention embodiment proposes, including:Pretreatment Module, for carrying out pre-identification to text to be identified according to preset rules, the initial name entity identified is described pre- If rule includes:Rule-based dictionary and be based on statistical model;Sort module, for determining belonging to the text to be identified Classification;Post-processing module, for according to the classification and the initial name entity, obtaining combine text, and according to described group It closes text and determines final name entity.
The name entity recognition device that second aspect of the present invention embodiment proposes uses rule-based word when passing through pre-identification Allusion quotation and mode based on statistical model can expand the range of initial name entity, solve merely using based on statistical model Mode is unable to the problem of identification feature unconspicuous name entity;By classifying to text to be identified, list can be solved Caused by the mode of pure rule-based dictionary name entity ambiguity problem, to there are the name entities and feature of ambiguity not Apparent name entity, it may have preferable recognition effect.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein:
Fig. 1 is the flow diagram for the name entity recognition method that one embodiment of the invention proposes;
Fig. 2 is the flow diagram for the name entity recognition method that another embodiment of the present invention proposes;
Fig. 3 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes;
Fig. 4 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes.
Specific implementation mode
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar module or module with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the present invention, and is not considered as limiting the invention.On the contrary, this The embodiment of invention includes all changes fallen within the scope of the spiritual and intension of attached claims, modification and is equal Object.
Fig. 1 is the flow diagram for the name entity recognition method that one embodiment of the invention proposes, this method includes:
S11:Pre-identification, the initial name entity identified, institute are carried out to text to be identified according to preset rules Stating preset rules includes:Rule-based dictionary and be based on statistical model.
Name Entity recognition in the present embodiment can apply the scene in a variety of needs, such as apply in phonetic synthesis In.It needs first to carry out text-processing to input text in phonetic synthesis, to treated, text carries out prosody prediction, sound later Learn parameter generation etc., the voice synthesized.Wherein, name Entity recognition can be as a basic step for being text-processing Suddenly.
In the present embodiment, by using rule-based dictionary and based on the mode of statistical model, relative to only with wherein One of mode, can acquisition as much as possible name entity.
For example, in the mode of rule-based dictionary, it is the mode based on string matching, can identifies song title, shadow Depending on the unconspicuous entity of the features such as name, to solve that based on statistical model the unobvious features such as song title, video display name cannot be obtained Name entity the problem of.
In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the apparent entity of the features such as some names, place name can be identified.
For example, text to be identified is:" wanting to listen the lustily water of Liu De China well ", can according to the mode of rule-based dictionary Include with the name entity identified:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", according to Mode based on statistical model, the name entity that can be identified include:" Liu Dehua (name) ".
Therefore, the initial name entity obtained after pre-identification includes:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", " Liu Dehua (name) ".
S12:Determine the classification belonging to the text to be identified.
Text categories are pre-defined classifications, such as:Music class, video display class, game class etc..
Corresponding text class can be determined according to the text message in the name entity identified and text to be identified Not.Specifically, characteristic information can be extracted from the name entity and text message identified, used according to characteristic information The Algorithm of documents categorization of maximum entropy determines the classification belonging to text.
In the present embodiment, characteristic information includes:Word in text to be identified, it is initial to name entity class previous with it Contamination, it is initial to name entity class and its latter contamination.
In the present embodiment, by selecting name entity, as characteristic information, name can be used with a word before and after it The contextual information of entity carries out the qi that disappears to name entity, and asking for ambiguousness may be carried by solving individually name entity itself Topic.
For example, on the basis of above-mentioned text to be identified, the characteristic information of selection includes:Think well, listen, Liu Dehua, , lustily water, song_ listen, s_song, listen _ singer, singer_, _ song, song_e, listen _ per, per_.Its In, song indicates that song title, singer indicate that singer's name, per indicate that name, s indicate that a word before beginning of the sentence, e indicate sentence The subsequent word of tail.
After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Originally the text categories belonged to.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, And maximum entropy Algorithm of documents categorization, it may be determined that the text categories that text to be identified belongs to, for example, above-mentioned is to be identified Text belongs to music class.
S13:According to the classification and the initial name entity, combine text is obtained, and true according to the combine text Fixed final name entity.
In combination, can specifically include:It obtains and belongs to the initial name entity of the classification, according to belonging to the classification Initial name entity and text to be identified in remaining word be combined, obtain combine text.
For example, when it is music class to determine classification, the initial name entity for belonging to music class can be obtained, such as includes:It is good Think (song title), Liu Dehua (singer's name), lustily water (song title).Later, can by these initial name entities with it is to be identified Text in remaining word be combined, remaining word includes:" listening ", " ", then after combination, obtained combine text packet It includes:" song listens the lustily water of singer ", " wanting to listen the song of singer well ", " song listens the song of Liu De China " etc..
After obtaining multiple combine texts as shown above, each combine text can be analyzed, to determine most Which combine text whole name entity analyzes more like in short for example, by the way of language model, later will be more like one Initial name entity in the combine text of word is determined as final name entity.Specifically, can be by excavating sound under line The training corpus of happy class, it is assumed that training corpus shows that the probability of occurrence of " wanting to listen the song of singer well " is maximum, then can determine Going out final name entity includes:Liu Dehua (singer name), lustily water (song title).
In the present embodiment, when pre-identification using rule-based dictionary and based on statistical model by way of, can expand The range of initial name entity, solution are unable to the unconspicuous name entity of identification feature by the way of based on statistical model merely The problem of;By classifying to text to be identified, can solve to name caused by being based purely on the mode of regular dictionary real Body ambiguity problem, to there are the unconspicuous name entities of the name entity and feature of ambiguity, it may have preferable identification Effect.
Fig. 2 is the flow diagram for the name entity recognition method that another embodiment of the present invention proposes, this method includes:
S21:Pre-identification, the initial name entity identified, institute are carried out to text to be identified according to preset rules Stating preset rules includes:Rule-based dictionary and be based on statistical model.
Name Entity recognition in the present embodiment can apply the scene in a variety of needs, such as apply in phonetic synthesis In.It needs first to carry out text-processing to input text in phonetic synthesis, to treated, text carries out prosody prediction, sound later Learn parameter generation etc., the voice synthesized.Wherein, name Entity recognition can be as a basic step for being text-processing Suddenly.
In the present embodiment, by using rule-based dictionary and based on the mode of statistical model, relative to only with wherein One of mode, can acquisition as much as possible name entity.
For example, in the mode of rule-based dictionary, it is the mode based on string matching, can identifies song title, shadow Depending on the unconspicuous entity of the features such as name, to solve that based on statistical model the unobvious features such as song title, video display name cannot be obtained Name entity the problem of.
In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the obvious entity type of the aspect ratios such as some names, place name can be identified.
It in the mode based on statistical model, can take as conventional method, be that basic unit is divided with word Class.Such as the form of a text (query) in training corpus:
From O
Horse LOC_S
Saddle LOC_M
Mountain LOC_E
To O
Peaceful O
Wave O
Why O
O
Walk O
Loc indicates that place name, LOC_S indicate that the word that place name starts, LOC_E indicate the word that place name terminates, LOC_M Indicate the middle word of place name.
The similar mark to place name can also identify name by the way of based on statistical model.
For example, text to be identified is:" wanting to listen the lustily water of Liu De China well ", can according to the mode of rule-based dictionary Include with the name entity identified:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", according to Mode based on statistical model, the name entity that can be identified include:" Liu Dehua (name) ".
Therefore, the initial name entity obtained after pre-identification includes:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", " Liu Dehua (name) ".
S22:According to the contextual information of initial the name entity and text to be identified, characteristic information is obtained;
Text categories are pre-defined classifications, such as:Music class, video display class, game class etc..
Corresponding text class can be determined according to the text message in the name entity identified and text to be identified Not.Specifically, characteristic information can be extracted from the name entity and text message identified, used according to characteristic information The Algorithm of documents categorization of maximum entropy determines the classification belonging to text.
In the present embodiment, characteristic information includes:Word in text to be identified, it is initial to name entity class previous with it Contamination, it is initial to name entity class and its latter contamination.
In the present embodiment, by selecting name entity, as characteristic information, name can be used with a word before and after it The contextual information of entity carries out the qi that disappears to name entity, and asking for ambiguousness may be carried by solving individually name entity itself Topic.
For example, on the basis of above-mentioned text to be identified, the characteristic information of selection includes:Think well, listen, Liu Dehua, , lustily water, song_ listen, s_song, listen _ singer, singer_, _ song, song_e, listen _ per, per_.Its In, song indicates that song title, singer indicate that singer's name, per indicate that name, s indicate that a word before beginning of the sentence, e indicate sentence The subsequent word of tail.
S23:According to the characteristic information and pre-set text sorting algorithm, the classification that text to be identified belongs to is determined.
After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Classification belonging to this.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, and Maximum entropy Algorithm of documents categorization, it may be determined that the classification that text to be identified belongs to, for example, above-mentioned text to be identified belongs to Music class.
S24:The initial name entity for belonging to the classification is obtained, according to the initial name entity for belonging to the classification and is waited for Remaining word in the text of identification is combined, and obtains combine text.
For example, when it is music class to determine classification, the initial name entity for belonging to music class can be obtained, such as includes:It is good Think (song title), Liu Dehua (singer's name), lustily water (song title).Later, can by these initial name entities with it is to be identified Text in remaining word be combined, remaining word includes:" listening ", " ", then after combination, obtained combine text packet It includes:" song listens the lustily water of singer ", " wanting to listen the song of singer well ", " song listens the song of Liu De China " etc..
S25:Obtain the training corpus for belonging to the classification collected in advance;
For example, collecting the training corpus of a large amount of music class.
S26:The probability of occurrence of each combine text is determined according to training corpus;
For example, counting the occurrence number of the training text of each combine text form in training corpus, " song is such as counted Listen singer ... " occurrence number of the training text of this form, the instruction of this form of statistics " ... listen the song of singer " The occurrence number for practicing text, later again with the training text of the occurrence number divided by form of ownership of the training text of each form Total occurrence number obtains the probability of occurrence of the corresponding combination text, and such as total occurrence number is M, " song listens singer's ... " The occurrence number of the training text of this form is N, then the probability of occurrence of " song listens the lustily water of singer " is N/M.
S27:By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final name entity.
For example, the probability of occurrence of combine text " wanting to listen the song of singer well " is maximum, then it is initial in the combine text Name entity is exactly final name entity, i.e., final name entity includes:Liu Dehua (singer's name) and lustily water (song Name).
In the present embodiment, when pre-identification using rule-based dictionary and based on statistical model by way of, can expand The range of initial name entity, solution are unable to the unconspicuous name entity of identification feature by the way of based on statistical model merely The problem of;It, can be in conjunction with text to be identified by selecting initially to name entity and its preceding the latter word as characteristic information Contextual information, solution names entity ambiguity problem caused by being based purely on the mode of regular dictionary, to there are ambiguities Name entity and the unconspicuous name entity of feature, it may have preferable recognition effect.
Fig. 3 is the structural schematic diagram for the name entity recognition device that another embodiment of the present invention proposes, which includes: Preprocessing module 31, sort module 32 and post-processing module 33.
Preprocessing module 31, for carrying out pre-identification to text to be identified according to preset rules, that is identified is first Begin name entity, and the preset rules include:Rule-based dictionary and be based on statistical model;
Name Entity recognition in the present embodiment can apply the scene in a variety of needs, such as apply in phonetic synthesis In.It needs first to carry out text-processing to input text in phonetic synthesis, to treated, text carries out prosody prediction, sound later Learn parameter generation etc., the voice synthesized.Wherein, name Entity recognition can be as a basic step for being text-processing Suddenly.
In the present embodiment, by using rule-based dictionary and based on the mode of statistical model, relative to only with wherein One of mode, can acquisition as much as possible name entity.
For example, in the mode of rule-based dictionary, it is the mode based on string matching, can identifies song title, shadow Depending on the unconspicuous entity of the features such as name, to solve that based on statistical model the unobvious features such as song title, video display name cannot be obtained Name entity the problem of.
In mode based on statistical model, condition random field (Conditional Random Field, CRF) may be used Model.In mode based on statistical model, the apparent entity class of the features such as some names, place name can be identified.
For example, text to be identified is:" wanting to listen the lustily water of Liu De China well ", can according to the mode of rule-based dictionary Include with the name entity identified:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", according to Mode based on statistical model, the name entity that can be identified include:" Liu Dehua (name) ".
Therefore, the initial name entity obtained after pre-identification includes:" thinking well (song title) ", " Liu Dehua (singer's name) ", " lustily water (song title) ", " Liu Dehua (name) ".
Sort module 32, for determining the classification belonging to the text to be identified;
In some embodiments, the sort module 32 is specifically used for:
According to the contextual information of initial the name entity and text to be identified, characteristic information is obtained;
According to the characteristic information and pre-set text sorting algorithm, the classification that text to be identified belongs to is determined.
Optionally, the characteristic information includes:
Word in text to be identified, initial name entity class and its previous contamination, and, initial name is real Body classification and its latter contamination.
Text categories are pre-defined classifications, such as:Music class, video display class, game class etc..
Corresponding classification can be determined according to the text message in the name entity identified and text to be identified. Specifically, characteristic information can be extracted from the name entity and text message identified, according to characteristic information using most The Algorithm of documents categorization of big entropy, determines the classification that text belongs to.
In the present embodiment, characteristic information includes:Word in text to be identified, it is initial to name entity class previous with it Contamination, it is initial to name entity class and its latter contamination.
In the present embodiment, by selecting name entity, as characteristic information, name can be used with a word before and after it The contextual information of entity carries out the qi that disappears to name entity, and asking for ambiguousness may be carried by solving individually name entity itself Topic.
For example, on the basis of above-mentioned text to be identified, the characteristic information of selection includes:Think well, listen, Liu Dehua, , lustily water, song_ listen, s_song, listen _ singer, singer_, _ song, song_e, listen _ per, per_.Its In, song indicates that song title, singer indicate that singer's name, per indicate that name, s indicate that a word before beginning of the sentence, e indicate sentence The subsequent word of tail.
After obtaining characteristic information, text to be identified can be determined according to characteristic information and pre-set text sorting algorithm Originally the text categories belonged to.Assuming that pre-set text sorting algorithm is maximum entropy Algorithm of documents categorization, then according to features described above information, And maximum entropy Algorithm of documents categorization, it may be determined that the classification belonging to text to be identified, for example, above-mentioned text to be identified Belong to music class.
Post-processing module 33, for according to the classification and the initial name entity, obtaining combine text, and according to institute It states combine text and determines final name entity.
In some embodiments, referring to Fig. 4, the post-processing module 33 includes:
First unit 331, for obtaining the initial name entity for belonging to the classification, according to belonging to the initial of the classification Remaining word in name entity and text to be identified is combined, and obtains combine text.
In some embodiments, referring to Fig. 4, the post-processing module 33 includes:
Second unit 332, for obtaining the training corpus for belonging to the classification collected in advance;It is determined according to training corpus The probability of occurrence of each combine text;By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final Name entity.
In combination, can specifically include:It obtains and belongs to the initial name entity of the classification, according to belonging to the text Remaining word in the initial name entity of classification and text to be identified is combined, and obtains combine text.
For example, when it is music class to determine classification, the initial name entity for belonging to music class can be obtained, such as includes:It is good Think (song title), Liu Dehua (singer's name), lustily water (song title).Later, can by these initial name entities with it is to be identified Text in remaining word be combined, remaining word includes:" listening ", " ", then after combination, obtained combine text packet It includes:" song listens the lustily water of singer ", " wanting to listen the song of singer well ", " song listens the song of Liu De China " etc..
After obtaining multiple combine texts as shown above, each combine text can be analyzed, to determine most Which combine text whole name entity analyzes more like in short for example, by the way of language model, later will be more like one Initial name entity in the combine text of word is determined as final name entity.Specifically, can be by excavating sound under line The training corpus of happy class, it is assumed that training corpus shows that the probability of occurrence of " wanting to listen the song of singer well " is maximum, then can determine Going out final name entity includes:Liu Dehua (singer name), lustily water (song title).
In the present embodiment, when pre-identification using rule-based dictionary and based on statistical model by way of, can expand The range of initial name entity, solution are unable to the unconspicuous name entity of identification feature by the way of based on statistical model merely The problem of;By classifying to text to be identified, can solve to name caused by being based purely on the mode of regular dictionary real Body ambiguity problem, to there are the unconspicuous name entities of the name entity and feature of ambiguity, it may have preferable identification Effect.
It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indicating or implying relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " Refer at least two.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also That each unit physically exists alone, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and when sold or used as an independent product, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiments or example in can be combined in any suitable manner.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims (4)

1. a kind of name entity recognition method, which is characterized in that including:
Pre-identification, the initial name entity identified, the default rule are carried out to text to be identified according to preset rules Include then:Rule-based dictionary and be based on statistical model, wherein it is described rule dictionary be string matching, the statistical model For conditional random field models;
Determine the classification belonging to the text to be identified;
Wherein, the classification belonging to the determination text to be identified, including:
The word of text to be identified and initial name entity class are determined, according to the word of text to be identified and the initial name Entity class obtains characteristic information, wherein the characteristic information includes:Word in text to be identified, initially names entity class Not with its previous contamination, and, it is initial to name entity class and its latter contamination;
According to the characteristic information and pre-set text sorting algorithm, the classification belonging to text to be identified is determined;
According to the classification and the initial name entity, combine text is obtained, and determine finally according to the combine text Name entity;
Wherein, described according to the classification and the initial name entity, combine text is obtained, including:
The initial name entity for belonging to the classification is obtained, according to the initial name entity and text to be identified for belonging to the classification Remaining word in this is combined, and obtains combine text.
2. according to the method described in claim 1, it is characterized in that, described determine that final name is real according to the combine text Body, including:
Obtain the training corpus for belonging to the classification collected in advance;
The probability of occurrence of each combine text is determined according to training corpus;
By the initial name entity in the maximum combine text of probability of occurrence, it is determined as final name entity.
3. a kind of name entity recognition device, which is characterized in that including:
Preprocessing module, for carrying out pre-identification, the initial name identified to text to be identified according to preset rules Entity, the preset rules include:Rule-based dictionary and be based on statistical model, wherein it is described rule dictionary be character string Match, the statistical model is conditional random field models;
Sort module, for determining the classification belonging to the text to be identified;
Wherein, the sort module, is specifically used for:
The word of text to be identified and initial name entity class are determined, according to the word of text to be identified and the initial name Entity class obtains characteristic information, wherein the characteristic information includes:Word in text to be identified, initially names entity class Not with its previous contamination, and, it is initial to name entity class and its latter contamination;
According to the characteristic information and pre-set text sorting algorithm, the classification belonging to text to be identified is determined;
Post-processing module, for according to the classification and the initial name entity, obtaining combine text, and according to the combination Text determines final name entity;
The post-processing module includes:
First unit, it is real according to the initial name for belonging to the classification for obtaining the initial name entity for belonging to the classification Remaining word in body and text to be identified is combined, and obtains combine text.
4. device according to claim 3, which is characterized in that the post-processing module includes:
Second unit, for obtaining the training corpus for belonging to the classification collected in advance;Each group is determined according to training corpus Close the probability of occurrence of text;By the initial name entity in the maximum combine text of probability of occurrence, it is real to be determined as final name Body.
CN201510556751.6A 2015-09-02 2015-09-02 Name entity recognition method and device Active CN105138515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510556751.6A CN105138515B (en) 2015-09-02 2015-09-02 Name entity recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510556751.6A CN105138515B (en) 2015-09-02 2015-09-02 Name entity recognition method and device

Publications (2)

Publication Number Publication Date
CN105138515A CN105138515A (en) 2015-12-09
CN105138515B true CN105138515B (en) 2018-10-19

Family

ID=54723866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510556751.6A Active CN105138515B (en) 2015-09-02 2015-09-02 Name entity recognition method and device

Country Status (1)

Country Link
CN (1) CN105138515B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570179B (en) * 2016-11-10 2019-11-19 中国科学院信息工程研究所 A kind of kernel entity recognition methods and device towards evaluation property text
CN108090039A (en) * 2016-11-21 2018-05-29 中移(苏州)软件技术有限公司 A kind of name recognition methods and device
CN107133259A (en) * 2017-03-22 2017-09-05 北京晓数聚传媒科技有限公司 A kind of searching method and device
CN108304424B (en) * 2017-03-30 2021-09-07 腾讯科技(深圳)有限公司 Text keyword extraction method and text keyword extraction device
CN108062402B (en) * 2017-12-27 2020-10-27 云润大数据服务有限公司 Event timeline mining method and system
CN108363701B (en) * 2018-04-13 2022-06-28 达而观信息科技(上海)有限公司 Named entity identification method and system
CN111178073B (en) * 2018-10-23 2024-06-04 北京嘀嘀无限科技发展有限公司 Text processing method, device, electronic equipment and storage medium
CN111292751B (en) * 2018-11-21 2023-02-28 北京嘀嘀无限科技发展有限公司 Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN109684631A (en) * 2018-12-12 2019-04-26 北京神州泰岳软件股份有限公司 Name entity abstracting method, device and medium
CN110210023A (en) * 2019-05-23 2019-09-06 竹间智能科技(上海)有限公司 A kind of calculation method of practical and effective name Entity recognition
CN110162795A (en) * 2019-05-30 2019-08-23 重庆大学 A kind of adaptive cross-cutting name entity recognition method and system
CN110795941B (en) * 2019-10-26 2024-04-05 创新工场(广州)人工智能研究有限公司 Named entity identification method and system based on external knowledge and electronic equipment
CN111062213B (en) * 2019-11-19 2024-01-12 竹间智能科技(上海)有限公司 Named entity identification method, device, equipment and medium
CN111274368B (en) * 2020-01-07 2024-04-16 北京声智科技有限公司 Groove filling method and device
CN111310481B (en) * 2020-01-19 2021-05-18 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium
CN111339910B (en) * 2020-02-24 2023-11-28 支付宝实验室(新加坡)有限公司 Text processing and text classification model training method and device
CN111400429B (en) * 2020-03-09 2023-06-30 北京奇艺世纪科技有限公司 Text entry searching method, device, system and storage medium
CN113204643B (en) * 2021-06-23 2021-11-02 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN104572958A (en) * 2014-12-29 2015-04-29 中国科学院计算机网络信息中心 Event extraction based sensitive information monitoring method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130159277A1 (en) * 2011-12-14 2013-06-20 Microsoft Corporation Target based indexing of micro-blog content

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN104572958A (en) * 2014-12-29 2015-04-29 中国科学院计算机网络信息中心 Event extraction based sensitive information monitoring method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于分类方法的音乐命名实体识别方法;付瑞吉 等;《黑龙江大学自然科学学报》;20091031;第26卷;第62、66-67页 *

Also Published As

Publication number Publication date
CN105138515A (en) 2015-12-09

Similar Documents

Publication Publication Date Title
CN105138515B (en) Name entity recognition method and device
CN106503805A (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method
Carvalho et al. Learning to extract signature and reply lines from email
CN102831891B (en) Processing method and system for voice data
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN107423278B (en) Evaluation element identification method, device and system
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN110442718A (en) Sentence processing method, device and server and storage medium
Xu et al. Exploiting shared information for multi-intent natural language sentence classification.
WO2012147428A1 (en) Text clustering device, text clustering method, and computer-readable recording medium
CN104331506A (en) Multiclass emotion analyzing method and system facing bilingual microblog text
CN103700370A (en) Broadcast television voice recognition method and system
CN106847279A (en) Man-machine interaction method based on robot operating system ROS
CN104778184A (en) Feedback keyword determining method and device
CN103098124B (en) Method and system for text to speech conversion
Smitha et al. Meme classification using textual and visual features
CN110263345A (en) Keyword extracting method, device and storage medium
CN104699819A (en) Sememe classification method and device
CN111354354A (en) Training method and device based on semantic recognition and terminal equipment
CN108804413B (en) Text cheating identification method and device
CN104778162A (en) Subject classifier training method and system based on maximum entropy
Dhrangadhariya et al. Machine learning assisted citation screening for systematic reviews
WO2023108459A1 (en) Training and using a deep learning model for transcript topic segmentation
Wang et al. Weakly Supervised Chinese short text classification algorithm based on ConWea model
CN110209821A (en) Text categories determine method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant