Entity disambiguation method and system
Technical field
The invention belongs to human-computer interaction technique fields, and in particular to entity disambiguation method and system.
Background technique
Entity can simply be interpreted as noun in field of human-computer interaction.Entity disambiguation refers to that the noun occurred in a sentence can
There can be multiple meanings (multiple senses of a dictionary entry), need to judge the concrete meaning that the noun represents in the sentence (language environment), such as
Li po may refer to poet li po, it is also possible to refer to song li po.
The concrete meaning of entity may be ever-changing in natural language, and these entities represent in concrete syntax environment
Single meaning.For the mankind, the concrete meaning of these entity on behalf can be intuitively judged, but to machine or machine
For device people, researcher is needed to research and develop special technology, machine or robot could be made to possess, and similar to people, judgement is real
Body represents ability/function of concrete meaning.
For at present, entity is disambiguated to carry out generally by the mode of rule, that is, defines an entity and other realities
The shortcomings that body while meaning for Shi Qi representative occur, this method is that a large amount of professionals is needed to participate in, need to define big gauge
Then, it is difficult to safeguard.There are also another using machine learning, the thinking of deep learning, using the computing capability of machine, by defeated
The meaning for entering entity in a large amount of sentences and sentence allows machine voluntarily to learn the model that can determine whether entity meaning in concrete syntax out,
But the shortcomings that this method is that parameter amount may larger more difficult debugging, training time for needing a large amount of data, training to need
Power is larger with calculating.
Summary of the invention
For the defects in the prior art, the present invention provides a kind of entity disambiguation method and system, overcomes the prior art
It needs Manual definition's rule or needs the defect of a large amount of training datas.
In a first aspect, a kind of entity disambiguation method, comprising the following steps:
It obtains the entity to be disambiguated in natural language to be analyzed and is somebody's turn to do multiple senses of a dictionary entry of entity to be disambiguated;
The each senses of a dictionary entry for calculating separately the entity to be disambiguated appears in total score in natural language to be analyzed;
The definition highest senses of a dictionary entry of total score is meaning of the entity to be disambiguated in natural language to be analyzed.
Preferably, the total score that each senses of a dictionary entry of the entity to be disambiguated appears in natural language to be analyzed passes through with lower section
Method calculates:
Counting statistics score;
Calculate semantic score;
Each senses of a dictionary entry is calculated according to the following formula appears in total score in natural language to be analyzed:
Total score=W1× statistics score+W2× semantic score;
Wherein, W1、W2Respectively weight, and W1+W2=1.
Preferably, the statistics score calculates by the following method:
The natural language to be analyzed is pre-processed, the stop words in natural language to be analyzed is removed;
More granularity participles are carried out to pretreated natural language to be analyzed, obtain the cliction up and down of entity to be disambiguated;
The subgraph of entity to be disambiguated is selected from preset knowledge mapping;The knowledge mapping includes the son of each entity
Figure, the subgraph of each entity includes all senses of a dictionary entry of the entity;
When the senses of a dictionary entry in the subgraph is corresponding with the cliction up and down in the natural language to be analyzed, determine in subgraph
The senses of a dictionary entry appear in natural language to be analyzed, define the senses of a dictionary entry be with reference to the senses of a dictionary entry;
The statistics score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, n is the quantity that the senses of a dictionary entry is referred in the subgraph of entity to be disambiguated, and i is variable.
Preferably, the semantic score calculates by the following method:
Obtain referential field;
The natural language to be analyzed and referential field are segmented respectively, obtain the cliction up and down of entity to be disambiguated with
And the reference participle of referential field;
Calculate separately each senses of a dictionary entry and each semantic similarity referring to participle of upper and lower cliction;
The semantic score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, max is maximizing function, and m is the number of the senses of a dictionary entry in upper and lower cliction, and j is variable.
Second aspect, a kind of entity disambiguating system, comprising:
Acquisition unit: for obtaining the entity to be disambiguated in natural language to be analyzed and being somebody's turn to do the multiple of entity to be disambiguated
The senses of a dictionary entry;
Analytical unit: each senses of a dictionary entry for calculating separately the entity to be disambiguated appears in total in natural language to be analyzed
Score;
Output unit: being entity to be disambiguated containing in natural language to be analyzed for defining the highest senses of a dictionary entry of total score
Justice.
Preferably, the analytical unit is specifically used for:
Counting statistics score;
Calculate semantic score;
Each senses of a dictionary entry is calculated according to the following formula appears in total score in natural language to be analyzed:
Total score=W1× statistics score+W2× semantic score;
Wherein, W1、W2Respectively weight, and W1+W2=1.
Preferably, the analytical unit is specifically used for:
The natural language to be analyzed is pre-processed, the stop words in natural language to be analyzed is removed;
More granularity participles are carried out to pretreated natural language to be analyzed, obtain the cliction up and down of entity to be disambiguated;
The subgraph of entity to be disambiguated is selected from preset knowledge mapping;The knowledge mapping includes the son of each entity
Figure, the subgraph of each entity includes all senses of a dictionary entry of the entity;
When the senses of a dictionary entry in the subgraph is corresponding with the cliction up and down in the natural language to be analyzed, determine in subgraph
The senses of a dictionary entry appear in natural language to be analyzed, define the senses of a dictionary entry be with reference to the senses of a dictionary entry;
The statistics score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, n is the quantity that the senses of a dictionary entry is referred in the subgraph of entity to be disambiguated, and i is variable.
Preferably, the analytical unit is specifically used for:
Obtain referential field;
The natural language to be analyzed and referential field are segmented respectively, obtain the cliction up and down of entity to be disambiguated with
And the reference participle of referential field;
Calculate separately each senses of a dictionary entry and each semantic similarity referring to participle of upper and lower cliction;
The semantic score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, max is maximizing function, and m is the number of the senses of a dictionary entry in upper and lower cliction, and j is variable.
As shown from the above technical solution, entity disambiguation method provided by the invention and system, overcome prior art needs
Manual definition's rule or the defect for needing a large amount of training datas.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art are briefly described.In all the appended drawings, similar element
Or part is generally identified by similar appended drawing reference.In attached drawing, each element or part might not be drawn according to actual ratio.
Fig. 1 is the flow chart for the entity disambiguation method that the embodiment of the present invention one provides.
Fig. 2 is the module frame chart for the entity disambiguating system that the embodiment of the present invention three provides.
Specific embodiment
It is described in detail below in conjunction with embodiment of the attached drawing to technical solution of the present invention.Following embodiment is only used for
Clearly illustrate technical solution of the present invention, therefore be only used as example, and cannot be used as a limitation and limit protection model of the invention
It encloses.It should be noted that unless otherwise indicated, technical term or scientific term used in this application are should be belonging to the present invention
The ordinary meaning that field technical staff is understood.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
As used in this specification and in the appended claims, term " if " can be according to context quilt
Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or
" if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true
It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
Embodiment one:
A kind of entity disambiguation method, referring to Fig. 1, comprising the following steps:
S1: obtaining the entity to be disambiguated in natural language to be analyzed and is somebody's turn to do multiple senses of a dictionary entry of entity to be disambiguated;
Specifically, each entity has multiple senses of a dictionary entry.Such as li po, it may refer to poet li po, it is also possible to refer to song li po,
So poet li po and song li po are two kinds of senses of a dictionary entry of li po.Natural language to be analyzed, that is, user's input conversation content.
S2: each senses of a dictionary entry for calculating separately the entity to be disambiguated appears in total score in natural language to be analyzed;
Specifically, total score is higher, and the probability for illustrating that the senses of a dictionary entry appears in natural language to be analyzed is bigger, which can
It can be the meaning in natural language to be analyzed.Total score is lower, illustrates that the senses of a dictionary entry appears in the probability in natural language to be analyzed
Smaller, which may not be the meaning in natural language to be analyzed.
S3: the definition highest senses of a dictionary entry of total score is meaning of the entity to be disambiguated in natural language to be analyzed.
Specifically, the cliction up and down of the information of this method binding entity itself and current session, to natural language to be analyzed
The specific entity for calling the turn appearance carries out meaning disambiguation.This method can be seen as selecting unique in knowledge mapping technology
Correct Ontology Mapping.Inherently understand this method, it is exactly selected from multiple senses of a dictionary entry of an entity in current language
Say the senses of a dictionary entry corresponding in environment.
Preferably, the total score that each senses of a dictionary entry of the entity to be disambiguated appears in natural language to be analyzed passes through with lower section
Method calculates:
Counting statistics score;
Calculate semantic score;
Each senses of a dictionary entry is calculated according to the following formula appears in total score in natural language to be analyzed:
Total score=W1× statistics score+W2× semantic score;
Wherein, W1、W2Respectively weight, and W1+W2=1.
Specifically, W1、W2For representing the reliability of phase reserved portion, value can initially be set to 0.5.Specific real
Now, it can be adjusted according to statistics score and the reliability of semantic score, it is both anyway adjusting and be 1.To institute
When stating weight and being adjusted, the different degree of two calculated result of weighed value adjusting is mainly utilized.If thinking that counting score more may be used
It leans on, just tunes up W1, otherwise tune up W2.If adjusting W1、 W2, so that W1Or W2In to have one be zero, then illustrate corresponding calculating knot
Fruit is with regard to nonsensical.The adjustment page of weight can be adjusted rule of thumb with project situation, can also be adjusted by machine learning,
Such as assume there is multiple groups weight, multiple groups calculated result and corresponding legitimate reading, using these as the input of machine learning, machine
Indoctrination session exports a model.This model can export the occurrence of most suitable two weights when receiving new input.
Embodiment two:
Embodiment two on the basis of example 1, increases the following contents:
1, the statistics score calculates by the following method:
The natural language to be analyzed is pre-processed, the stop words in natural language to be analyzed is removed;
Specifically, remove the stop words in natural language to be analyzed, the efficiency of natural language processing can be promoted, save sky
Between.Stop words vocabulary wherein can be set for filtering stop words, stop words vocabulary can be according to specific product and function need
It asks, determines generation by product personnel and technical staff.
More granularity participles are carried out to pretreated natural language to be analyzed, obtain the cliction up and down of entity to be disambiguated;
Specifically, more granularities participle, which refers in participle and no longer only to be segmented according to Monosized powder, (such as no longer provides
The maximum number of words of one word), but the word frequency of all words in dictionary is counted, the high word of word frequency is seen as a word in participle.
Such as: " Valentine's Day " word frequency 1000, " sweet heart " word frequency 500, " section " word frequency 400, therefore the minimum word frequency of word segmentation result " Valentine's Day "
It is 1000, the minimum word frequency of " sweet heart/section " is 400, therefore word segmentation result should be " Valentine's Day ", and three words are considered as one not
Can cutting word.
The subgraph of entity to be disambiguated is selected from preset knowledge mapping;The knowledge mapping includes the son of each entity
Figure, the subgraph of each entity includes all senses of a dictionary entry of the entity;
When the senses of a dictionary entry in the subgraph is corresponding with the cliction up and down in the natural language to be analyzed, determine in subgraph
The senses of a dictionary entry appear in natural language to be analyzed, define the senses of a dictionary entry be with reference to the senses of a dictionary entry;
The statistics score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, n is the quantity that the senses of a dictionary entry is referred in the subgraph of entity to be disambiguated, and i is variable.
Specifically, statistics score calculate when, be considered as the entity context word quantity to be disambiguated for including and these on
Lower cliction is at a distance from entity is between natural language to be analyzed.Such as assume that entity to be disambiguated is A, natural language to be analyzed
For B, the subgraph of A is had selected in knowledge mapping first, includes several words (i.e. the senses of a dictionary entry) in subgraph.It is corresponding that A is sought first
Several senses of a dictionary entry appear in the quantity in B, obtain molecule, it is assumed that in the subgraph of A word C be include word in B, defined terms C is
With reference to the senses of a dictionary entry, then the distance of C and A is exactly denominator, system successively is calculated with reference to the senses of a dictionary entry according to all in the subgraph of entity to be disambiguated
Count score.
2, the semantic score calculates by the following method:
Obtain referential field;
Specifically, since the description that industry uniformly approves that various encyclopaedias treat disambiguation entity is that the corresponding senses of a dictionary entry of the entity is most smart
Quasi- semantic description, such as Baidupedia etc..So referential field can be selected to carry out semantic computation score from encyclopaedia.Reference
Field can be a word, a few words or one section of word.
The natural language to be analyzed and referential field are segmented respectively, obtain the cliction up and down of entity to be disambiguated with
And the reference participle of referential field;
Calculate separately each senses of a dictionary entry and each semantic similarity referring to participle of upper and lower cliction;
Specifically, it is assumed that the senses of a dictionary entry of upper and lower cliction includes A1, A2, A3, and the word segmentation result of referential field includes B1, B2, B3;
It so needs to calculate separately the semantic similarity of A1 and B1, A1 and B2, A1 and B3, calculates the language of A2 and B1, A2 and B2, A2 and B3
Adopted similarity, and so on.Each senses of a dictionary entry and each semantic similarity referring to participle can thus be obtained.Semantic similarity can
To be calculated using existing Arithmetic of Semantic Similarity.
The semantic score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, max is maximizing function, and m is the number of the senses of a dictionary entry in upper and lower cliction, and j is variable.
Specifically, the maximum value for counting all semantic similarities under the same senses of a dictionary entry, defining the maximum value is the senses of a dictionary entry
Semantic score.
Method provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side
Corresponding contents in method embodiment.
Embodiment three:
A kind of entity disambiguating system, referring to fig. 2, comprising:
Acquisition unit: for obtaining the entity to be disambiguated in natural language to be analyzed and being somebody's turn to do the multiple of entity to be disambiguated
The senses of a dictionary entry;
Analytical unit: each senses of a dictionary entry for calculating separately the entity to be disambiguated appears in total in natural language to be analyzed
Score;
Output unit: being entity to be disambiguated containing in natural language to be analyzed for defining the highest senses of a dictionary entry of total score
Justice.
Preferably, the analytical unit is specifically used for:
Counting statistics score;
Calculate semantic score;
Each senses of a dictionary entry is calculated according to the following formula appears in total score in natural language to be analyzed:
Total score=W1× statistics score+W2× semantic score;
Wherein, W1、W2Respectively weight, and W1+W2=1.
Preferably, the analytical unit is specifically used for:
The natural language to be analyzed is pre-processed, the stop words in natural language to be analyzed is removed;
More granularity participles are carried out to pretreated natural language to be analyzed, obtain the cliction up and down of entity to be disambiguated;
The subgraph of entity to be disambiguated is selected from preset knowledge mapping;The knowledge mapping includes the son of each entity
Figure, the subgraph of each entity includes all senses of a dictionary entry of the entity;
When the senses of a dictionary entry in the subgraph is corresponding with the cliction up and down in the natural language to be analyzed, determine in subgraph
The senses of a dictionary entry appear in natural language to be analyzed, define the senses of a dictionary entry be with reference to the senses of a dictionary entry;
The statistics score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, n is the quantity that the senses of a dictionary entry is referred in the subgraph of entity to be disambiguated, and i is variable.
Preferably, the analytical unit is specifically used for:
Obtain referential field;
The natural language to be analyzed and referential field are segmented respectively, obtain the cliction up and down of entity to be disambiguated with
And the reference participle of referential field;
Calculate separately each senses of a dictionary entry and each semantic similarity referring to participle of upper and lower cliction;
The semantic score of each senses of a dictionary entry is calculated according to the following formula:
Wherein, max is maximizing function, and m is the number of the senses of a dictionary entry in upper and lower cliction, and j is variable.
The entity disambiguating system overcomes the prior art and needs Manual definition's rule or need lacking for a large amount of training datas
It falls into.
In several embodiments provided herein, it should be understood that disclosed system can be by others side
Formula is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only one
Kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
It is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit,
It is also possible to electricity, mechanical or other form connections.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needs
Purpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention
Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey
The medium of sequence code.
System provided by the embodiment of the present invention, to briefly describe, embodiment part does not refer to place, can refer to aforementioned side
Corresponding contents in method embodiment.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme should all cover within the scope of the claims and the description of the invention.