CN108009184A - Knowledge base example of the same name obscures the method and device of detection - Google Patents

Knowledge base example of the same name obscures the method and device of detection Download PDF

Info

Publication number
CN108009184A
CN108009184A CN201610974455.2A CN201610974455A CN108009184A CN 108009184 A CN108009184 A CN 108009184A CN 201610974455 A CN201610974455 A CN 201610974455A CN 108009184 A CN108009184 A CN 108009184A
Authority
CN
China
Prior art keywords
text
vector
knowledge base
object vector
ordered set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610974455.2A
Other languages
Chinese (zh)
Other versions
CN108009184B (en
Inventor
谢海华
黄肖俊
吕肖庆
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Founder Apabi Technology Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CN201610974455.2A priority Critical patent/CN108009184B/en
Publication of CN108009184A publication Critical patent/CN108009184A/en
Application granted granted Critical
Publication of CN108009184B publication Critical patent/CN108009184B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present invention provides the method and device that a kind of knowledge base example of the same name obscures detection.This method includes:Text library is obtained, text library content is related to the content of knowledge base;First object is obtained, according to first object and text library, the set of the corresponding object vector of construction first object;Wherein, the dimension of each object vector is equal with the number of text in text library, and first object is any one example in knowledge base;Cluster analysis is carried out to each object vector, determines whether knowledge base occurs example of the same name and obscure according to the result of cluster analysis.The embodiment of the present invention by detecting multiple first objects in knowledge base automatically, to determine the ordered set of other examples of the same name whether is contaminated with first object, so as to fulfill the automatic detection obscured knowledge base example of the same name, without manually checking each first object, substantial amounts of manpower is saved, and substantially increases detection efficiency.

Description

Knowledge base example of the same name obscures the method and device of detection
Technical field
The present embodiments relate to knowledge base and knowledge mapping technical field, more particularly to a kind of knowledge base example of the same name to mix Confuse the method and device of detection.
Background technology
Knowledge base be a kind of form structure with triple stored knowledge database, for a certain field or Mass knowledge is structurally stored in certain industry.For example, a historical knowledge base can store the sea in history field Measure knowledge, including each historical personage, historical events etc..Knowledge base is using example as main description object, using object-oriented Method represents knowledge, and an example is the reference to one in reality specific or abstract affairs.For example, example can represent one Personage, can also represent city, a something etc..
One knowledge base generally includes multiple examples, and the relation between the multiple attributes and each example of example uses The structure storage of triple.Triple is the foundation structure for being used to represent knowledge in knowledge base, its structure can be expressed as<It is real Example ID, predicate, example ID/property value>.Wherein, first element in triple is example ID, for representing belonging to triple The ID of example 1;Second element in triple is predicate, for describing example relationship or attribute;The 3rd in triple A element can be the ID of another example 2, or the property value of example 1.When the 3rd element is the ID of example 2, then should Triple is relation triple, for describing the relation between example 1 and example 2, predicate statement at this time example 1 and example 2 it Between relation;When the 3rd element is property value, then the triple is attribute triple, for describing a category of example 1 Property, the attribute of the example 1 of predicate statement at this time.For example, example 1 represents poet li po, example 1 includes triple<Id1, name, li po >, example 2 represents poet Tu Fu, and example 2 includes following two triples<Id2, name, Tu Fu>,<Id2, friend, id1>;Then< Id2, name, Tu Fu>For an attribute triple of example 2, the entitled Tu Fu of expression example 2;<Id2, friend, id1>For reality One relation triple of example 2, represents the relation of example 2 and example 1, can represent that Tu Fu and li po are friends in the example.
Each example has unique name attribute in knowledge base, and name attribute is used for the title for storing example.Due to name The predicate for claiming attribute is " name ", if the property value of the name attribute of two examples is identical, the two examples are example of the same name. During knowledge base is built, it is easy to occur that attribute is obscured between two or more examples of the same name.Such as:Entitled li po Two people, first man, name:Li po, occupation:Poet, age:The Tang Dynasty, gender:Man.Second people, name:Li po, duty Industry:Student, date of birth:1996, gender:Female, specialty:Artificial intelligence.In knowledge base has been built, it is likely that occur two The situation that the attribute of corresponding two examples of people is obscured, for example, storing the example for including following triple in knowledge base:<Id, Name, li po>,<Id, occupation, student>,<Id, age, the Tang Dynasty>,<Id, gender, man>,<Id, specialty, artificial intelligence>.The example In be contaminated with both of the aforesaid entitled " li po " corresponding two examples of two people attribute, that is, occur two instance properties of the same name Situation about obscuring.If the attribute of two examples of the same name is obscured, then it is assumed that the two examples of the same name are obscured.
Existing knowledge base generally comprises multigroup example of the same name, there is a situation where that example of the same name is obscured.At present, technical staff According to the context of each text in related text storehouse, the ternary by artificial nucleus to each example in knowledge base Group, to determine the triple of other examples of the same name whether is contaminated with the example.Since knowledge base is typically stored with mass knowledge, Example quantity is very big, and the triple quantity that example includes is even more huge, and the method for this artificial nucleus couple expends substantial amounts of manpower, and Take considerable time, efficiency is very low.
The content of the invention
The embodiment of the present invention provides the method and device that a kind of knowledge base example of the same name obscures detection, to solve the prior art In situation about obscuring in knowledge base with the presence or absence of example of the same name determined by the method for artificial nucleus couple, expend substantial amounts of manpower, and Take considerable time, the problem of efficiency is very low.
The one side of the embodiment of the present invention is to provide the method that a kind of knowledge base example of the same name obscures detection, including:
Text library is obtained, the text library content is related to the content of knowledge base, and the text library includes at least one text This, each text includes at least one sentence, and the knowledge base includes multiple examples, and each example includes multiple by N number of The ordered set of sentence composition, N are the positive integer more than or equal to 3;
First object is obtained, according to the first object and the text library, constructs the corresponding target of the first object The set of vector, wherein the dimension of each object vector is equal with the number of text in the text library;Wherein, described One target is any one example in the knowledge base;
Cluster analysis is carried out to each object vector, whether the knowledge base is determined according to the result of the cluster analysis Generation example of the same name is obscured.
The other side of the embodiment of the present invention is to provide the device that a kind of knowledge base example of the same name obscures detection, including:
Acquisition module, for obtaining text library, the text library content is related to the content of knowledge base, the text library bag At least one text is included, each text includes at least one sentence, and the knowledge base includes multiple examples, each example bag Multiple ordered sets being made of N number of sentence are included, N is the positive integer more than or equal to 3;
Constructing module, for obtaining first object, according to the first object and the text library, constructs first mesh The set of corresponding object vector is marked, wherein the number phase of the dimension and text in the text library of each object vector Deng;Wherein, the first object is any one example in the knowledge base;
Cluster Analysis module, for carrying out cluster analysis to each object vector, according to the result of the cluster analysis Determine whether the knowledge base occurs example of the same name and obscure.
Knowledge base example of the same name provided in an embodiment of the present invention obscures the method and device of detection, passes through acquisition and knowledge base Text library with related content, for the first object in knowledge base, first object can be any one reality in knowledge base Example, the set of the corresponding object vector of each first object is constructed according to institute's text library;By being clustered to each object vector Analysis, determines whether knowledge base occurs example of the same name and obscure according to the result of cluster analysis;Knowledge base example of the same name is realized to mix The automatic detection confused, without manually checking each first object, saves substantial amounts of manpower, and substantially increase detection Efficiency.
Brief description of the drawings
Fig. 1 is the flow diagram for the method that the knowledge base example of the same name that the embodiment of the present invention one provides obscures detection;
Fig. 2 is the flow diagram for the method that knowledge base provided by Embodiment 2 of the present invention example of the same name obscures detection;
Fig. 3 is the structure diagram for the device that the knowledge base example of the same name that the embodiment of the present invention three provides obscures detection;
Fig. 4 is the structure diagram for the device that the knowledge base example of the same name that the embodiment of the present invention four provides obscures detection.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art All other embodiments obtained without creative efforts, belong to the scope of protection of the invention.
In the description of the present application, it is to be understood that knowledge base includes multiple examples, and each example includes multiple by N The ordered set of a sentence composition, N are the positive integer more than or equal to 3.Ordered set can be used for describe example attribute or Relation between person's example, and each sentence is arranged according to predefined procedure in ordered set.For example, during N=3, ordered set can be with Represented by the way of triple, each element can be a sentence in triple.Wherein, sentence be by a word or Related one group of word is formed on person's syntax.
Embodiment one
Fig. 1 is the method flow diagram that knowledge base provided in an embodiment of the present invention example of the same name obscures detection.The present embodiment pin To the situation for determining with the presence or absence of example of the same name to obscure in knowledge base in the prior art by the method for artificial nucleus couple, expend a large amount of Manpower, and take considerable time, the problem of efficiency is very low, there is provided knowledge base example of the same name obscures the method for detection, the party Method comprises the following steps that:
Step S101, text library is obtained, text library content is related to the content of knowledge base, and text library includes at least one text This, each text includes at least one sentence.
In the present embodiment, text library is the set of natural language text, and a text library includes at least one text, often A text includes at least one sentence.Whether text library content is related to the content of knowledge base, can be deposited as judgemental knowledge storehouse In the reference for the situation that example of the same name is obscured.For example, knowledge base is the knowledge base in a history field, then the text in text library It can include the text and other and the relevant text of history in the e-book of a historical textbook.
It is alternatively possible to chosen from existing text library with the relevant text of knowledge base content to be detected, or from electricity Text is directly acquired in the computer-readable file such as the philosophical works, web page text, electronic document, forms the text library of the present invention.
Step S102, first object is obtained, according to first object and text library, the corresponding object vector of construction first object Set.
Wherein, the dimension of each object vector is equal with the number of text in text library.
In the present embodiment, first object can be any one example in knowledge base, and each first object includes more A ordered set being made of N number of sentence, N are the positive integer more than or equal to 3.
Further, each ordered set in first object corresponds to an object vector.It is each in the object vector The value of dimension is corresponding with a text in text library, the feelings that can be occurred according to the sentence in ordered set in the text Condition determines.That is, the ordered set that number vectorial in the set of the corresponding object vector of first object includes with the first object The number of conjunction is equal, and the ordered set in first object and the object vector in the set of object vector correspond.
It should be noted that one or more of knowledge base first object, this implementation can be obtained in the present embodiment Whether example occurs detection first object the process that example of the same name is obscured and illustrates only exemplified by obtaining a first object. When obtaining multiple first objects, to the detection method all same of each first object.The present embodiment is for obtaining first object Quantity be not specifically limited.
Step S103, cluster analysis is carried out to each object vector, determines whether knowledge base is sent out according to the result of cluster analysis Raw example of the same name is obscured.
Specifically, according to the similarity between each object vector, to the target in the object vector set of first object Vector carries out cluster analysis, the higher object vector of similarity is merged, according to cluster analysis result, if in obtained set The number of vector is more than 1, it is determined that knowledge base occurs example of the same name and obscures.
Cluster analysis can use any type of the prior art without in advance specify cluster number clustering method into OK, such as hierarchy clustering method, this will not be repeated here for the present embodiment.
The embodiment of the present invention is by obtaining the text library for having related content with knowledge base, for the first mesh in knowledge base Mark, first object can be any one example in knowledge base, and the corresponding target of each first object is constructed according to institute's text library The set of vector;By carrying out cluster analysis to each object vector, determine whether knowledge base occurs according to the result of cluster analysis Example of the same name is obscured;The automatic detection that knowledge base example of the same name is obscured is realized, without manually carrying out core to each first object It is right, substantial amounts of manpower is saved, and substantially increase detection efficiency.
Embodiment two
Fig. 2 is the method flow diagram that knowledge base provided by Embodiment 2 of the present invention example of the same name obscures detection.In above-mentioned reality On the basis of applying example one, this method is described in detail in the present embodiment, and this method specifically includes following steps:
Step S201, text library is obtained, text library content is related to the content of knowledge base, and text library includes at least one text This, each text includes at least one sentence.
Step S201 is similar with step S101, and details are not described herein for the present embodiment.
Step S202, first object is obtained, the example ID in the ordered set of first object is replaced with example ID corresponds to Example title, obtain corresponding second target of first object.
In the present embodiment, last sentence of at least one ordered set is example in each example of knowledge base Title, and the example ID that first sentence of each ordered set is the corresponding example of ordered set, example ID are used to uniquely mark Know an example.
Wherein, first object can be any one example in knowledge base.In practical applications, can be referred to by technical staff One or more example in knowledge base is determined as first object, whether is mixed in the ordered set by detecting first object There is the ordered set of other examples of the same name, further determine that whether knowledge base occurs example of the same name and obscure.
For example, N=3, ordered set is represented by the way of triple, and each example includes multiple triples.Assuming that There are two example A and example B, example A to include following 2 ordered set:<Id1, name, li po>With<Id1, occupation, poet>, its In, first sentence id1 is the example ID of example A in ordered set.The ordered set of existence anduniquess in example A<Id1, name, Lee In vain>, its last sentence " li po " represents the title of example A.Example B includes following ordered set:<Id2, name, Tu Fu >,<Id2, occupation, poet>,<Id2, works, poem with five characters in one line《Spring hopes》>,<Id2, hobby, online game>,<Id2, friend, id1 >,<Id2, post, company executive president>.Wherein, the example ID that first sentence id2 is example B in ordered set.In example B The ordered set of existence anduniquess<Id2, name, Tu Fu>, its last sentence " Tu Fu " represents the title of example B.
So, the first object obtained in the step is example B, and the example ID occurred in the ordered set of example B has " id1 " and " id2 "." id2 " in the ordered set of example B is replaced with to the name of example B corresponding with id2 in this step Claim, that is, use " Tu Fu " replacement " id2 ";And " id1 " in the ordered set of example B is replaced with example A's corresponding with id1 Title, that is, use " li po " replacement " id1 ".Thus obtaining corresponding second target of first object includes following ordered set:<Du Just, name, Tu Fu>,<Tu Fu, occupation, poet>,<Tu Fu, works, poem with five characters in one line《Spring hopes》>,<Tu Fu, hobby, online game>, <Tu Fu, friend, li po>,<Tu Fu, post, company executive president>.
It is alternatively possible to removed unrelated ordered set is obscured detection example attribute in first object, only to protecting Example ID in the ordered set stayed replaces with the title of the corresponding example of example ID.In this way, the first object got corresponds to The second target in the quantity of ordered set be less than the quantity of the ordered set in former first object, it is possible to reduce what need to be detected has The quantity of ordered sets, so as to improve operational efficiency.For example, if first object represents a people, the first mesh can be removed The ordered set of instance name and personage's gender is represented in mark.In the present embodiment, detection example attribute is obscured unrelated has Ordered sets can be specified according to being actually needed by technical staff, and the present embodiment is not specifically limited this.
It should be noted that when first object is two and the above, step can be performed to each first object respectively S202-S209, determines the ordered set of other examples of the same name whether is contaminated with each first object.When there is a first object In when being contaminated with the ordered set of other examples of the same name, determine that the first object is obscured there is a situation where example of the same name, go forward side by side one Step determines that knowledge base occurs example of the same name and obscures.When all first objects there is a situation where example of the same name without obscuring, into One step determines that knowledge base does not occur example of the same name and obscures.
Step S203, according to the second target and text library, the set of the corresponding object vector of the second target of construction.
Wherein, the dimension of each object vector is equal with the number of text in text library.
Specifically, which can realize with the following method:
The corresponding interim vector of each ordered set of the second target is obtained, wherein the dimension and text library of each interim vector The number of middle text is equal;For each ordered set in the second target, determine whether each text includes in ordered set All object statements;If it is determined that result is yes, then by dimension corresponding with text in the corresponding interim vector of ordered set Value is arranged to first object value;If it is determined that result is no, then by dimension corresponding with text in the corresponding interim vector of ordered set Several values is arranged to the second desired value, obtains the corresponding object vector of ordered set;Obtain the corresponding object vector of the second target Set.
For example, first object value can be set as 1, and the second desired value can be set as 2, and object statement is ordered set The sentence of at least two predeterminated positions in conjunction, the sentences of for example, at least two predeterminated positions be in ordered set first position and The sentence of last position.
In the present embodiment, text library can be expressed as WB={ W1,W2,…,Wl,…,Wt, wherein t is text library Chinese This number;WlRepresent l-th of text in text library, l=1,2 ..., t.Wherein, the value of t is bigger, the effect of cluster analysis Better.In practical applications, usually it is chosen for the text library to tens of thousands of a texts comprising hundreds of so that Clustering Effect is preferable and poly- The calculation amount of alanysis process is not too large.
The second target is represented with E, and multiple ordered sets that E includes can be expressed as { V1,V2,…,Vi,…,Vn, wherein, N represents the number of ordered set in E, ViRepresent any one ordered set in E, i=1,2 ..., n.Use CiRepresent ordered set Close Vi, corresponding object vector, then the dimension of object vector is identical with the text number in text library, Ci={ Ci1,Ci2,…, Cil,…,Cit, wherein CilThe value that dimension in object vector is l, with W in text librarylCorrespondence.Text W1,W2,…, Wl,…,WtThe value C of dimension is corresponded to object vector respectivelyi1,Ci2,…,Cil,…,CitCorrespond.
In this step, if text WlIt is middle to include ordered set V at the same time there are a sentenceiIn at least two default positions The sentence put, then by object vector with text WlThe value C of corresponding dimensionilIt is arranged to 1;If being not present, by object vector In with text WlThe value C of corresponding dimensionilIt is arranged to 0.
Based on the citing in above-mentioned steps S202, obtaining corresponding second target of first object includes following 6 ordered set Close:<Tu Fu, name, Tu Fu>,<Tu Fu, occupation, poet>,<Tu Fu, works, poem with five characters in one line《Spring hopes》>,<Tu Fu, hobby, network Game>,<Tu Fu, friend, li po>,<Tu Fu, post, company executive president>.Priority according to above-mentioned 6 ordered sets is suitable Sequence, is denoted as C1, C2, C3, C4, C5, C6 respectively by the corresponding interim vector of ordered set respectively.Assuming that at least two predeterminated positions Sentence be ordered set in first position and last position sentence, that is to say in triple first and the 3rd A sentence.Assuming that text library includes 4 texts, it is respectively text 1, text 2, text 3 and text 4.So, in the step In S203, the dimension of the corresponding interim vector of each ordered set of the second target is 4.With ordered set<Tu Fu, works, five Say regulated verse《Spring hopes》>Exemplified by, its corresponding interim vector C3 can be expressed as C3={ C31, C32, C33, C34 }, wherein C31, C32, C33, C34 represent the value that dimension is 1,2,3,4 in interim vector C3 respectively, and respectively with the text 1 in text library, it is literary Sheet 2, text 3 and text 4 correspond to.If occur " Tu Fu " at the same time there are a sentence in text 1, in the sentence and " five say Regulated verse《Spring hopes》", then the value of 1 corresponding dimension of text is set to 1, i.e. C31=1;If do not have in all sentences in text 1 There is " Tu Fu " and " poem with five characters in one line at the same time《Spring hopes》", then the value of 1 corresponding dimension of text is set to 0, i.e. C31=0.Similarly, C32, the value of C33, C34 can be determined respectively according to text 2, text 3 and text 4.
Above-mentioned steps S202-S203 is according to first object and text library, constructs the corresponding object vector of first object The process of set.
Step S204, the similarity of any two object vector in the set of object vector is determined.
In the present embodiment, for any two object vector C in the set of object vectori={ Ci1,Ci2,…, Cit, Cj={ Cj1,Cj2,…,Cjt, the two object vectors CiAnd CjSimilarity can use SimilarityLengthRatio (Ci,Cj) represent, and can be calculated using following methods:
First, the similarity molecule of two object vectors is calculated:Similarity(Ci,Cj)=| Ci&Cj|, wherein, Ci&Cj ={ Ci1&Cj1,Ci2&Cj2,…,Cit&Cjt, and a t dimensional vector, | Ci&Cj| it is Ci&CjIn each dimension value in 1 number Amount.
Wherein
Then, object vector C is calculatediAnd CjSimilarity:
Wherein, Similarity (Ci,Cj) for the similarity molecule of two object vectors, length (Ci) it is object vector Ci1 quantity in the value of each dimension, length (Cj) it is object vector Cj1 quantity in the value of each dimension, min (length (Ci),length(Cj)) represent length (Ci) and length (Cj) in minimum value.
For example, it is assumed that there are two object vector C1={ 1,0,1,1,0,1 }, C2={ 1,1,0,0,1,0 }.Then according to above-mentioned Formula can obtain C1&C2={ 1&1,0&1,1&0,1&0,0&1,1&0 }, i.e. C1&C2={ 1,0,0,0,0,0 }.Understand:C1&C2 Each dimension value in 1 number be 1, C1Each dimension value in 1 number be 4, C2Each dimension value in 1 number be 3.I.e. | C1&C2|=1, length (C1)=4, length (C2)=3, can obtain Similarity (C1,C2)=| C1&C2|=1, min (length(C1),length(C2))=3, it is hereby achieved that two object vector C1And C2Similarity be: SimilarityLengthRatio(C1,C2)=1/3.
Step S205, judge whether each similarity is respectively less than first threshold.
If judging result is no, step S206-S207 is performed;If so then execute step S208.
For example, first threshold can be 1/4.Alternatively, first threshold can also be 1/2,1/6,1/8, Ke Yiyou Technical staff is set according to actual conditions, and the embodiment of the present invention is not specifically limited for the value of first threshold.
In the step, by the similarity of the definite any two object vector in above-mentioned steps S205 and first threshold into Row compares, and judges whether that the similarity of any two object vector is respectively less than first threshold.If judging result is no, illustrate to deposit In the higher object vector of similarity, cluster analysis need to be continued, perform step S206-S207.If the determination result is YES, then Cluster analysis terminates, and determines that current object vector collection is combined into the result of cluster analysis.Object vector in cluster analysis result Number is more than 1, it may be determined that the ordered set of other examples of the same name is contaminated with the corresponding first object of object vector set, can be with Determine that knowledge base occurs example of the same name and obscures, perform step S208.
Step S206, merge two object vectors of similarity maximum, and the object vector after merging is updated to target Object vector in the set of vector.
In this step, it is first determined two object vectors of similarity maximum, by two targets of similarity maximum to Amount merges operation, the object vector being then updated to the object vector after merging in the set of object vector.After renewal Object vector set in, the fresh target vector that merges will replace two maximum former object vectors of original similarity. That is, after union operation, two former object vectors that original similarity is maximum in object vector set will not exist.
In the present embodiment, for any two vector C in the set of object vectoriAnd Cj:Ci={ Ci1,Ci2,…, Cit, Cj={ Cj1,Cj2,…,Cjt, use CijRepresent the new object vector obtained after merging, then CiAnd CjUnion operation can To be realized using following methods:
Cij={ Ci1|Cj1,Ci2|Cj2,…,Cit|Cjt,
Wherein,
For example, it is assumed that there are two object vector C1={ 1,0,1,0 }, C2={ 1,1,0,0 }, in the object vector after merging First dimension value be:1 | 1=1, the value of the second dimension are:0 | 1=1, the value of the third dimension are:1 | 0=1, the value of fourth dimension are:0| 0=0, then can obtain C1And C2The object vector obtained after merging is { 1,1,1,0 }.
Step S207, judge whether object vector number is 1 in the set of object vector.
If the determination result is YES, when object vector number is 1 in object vector set, then the process of cluster analysis terminates, Determine current goal vector set cooperation be cluster analysis as a result, due in object vector set object vector number be 1, can be with Determine the ordered set for not mixing other examples of the same name in the corresponding first object of object vector set, it may be determined that knowledge base Example of the same name does not occur to obscure, performs step S209.
If judging result is no, when object vector number is not 1 in object vector set, illustrate the process of cluster analysis also Do not terminate, need to continue to carry out cluster analysis to each object vector in object vector set, return and perform step S204, determine mesh The operation of the similarity of any two object vector in the set of vector is marked, until the object vector in the set of object vector Number be 1.
Step S208, determine that knowledge base occurs example of the same name and obscures.
In the present embodiment, when in cluster analysis result object vector number be more than 1, it may be determined that object vector set pair The ordered set of other examples of the same name is contaminated with the first object answered, it may be determined that knowledge base occurs example of the same name and obscures.
Preferably, can also determine how many mixes in first object according to object vector number in cluster analysis result The ordered set of a example of the same name, may thereby determine that the number for the example of the same name obscured.
Step S209, determine that knowledge base does not occur example of the same name and obscures.
Above-mentioned steps S204-S209 is to carry out cluster analysis to object vector, and knowledge is determined according to the result of cluster analysis Whether storehouse occurs the process that example of the same name is obscured.
The method that the embodiment of the present invention obscures knowledge base example of the same name detection is described in detail.Specifically provide The set of the corresponding object vector of first object is constructed, cluster analysis is carried out to each object vector, and according to the knot of cluster analysis Fruit determines whether knowledge base occurs the detailed process that example of the same name is obscured, and realizes the automatic inspection that knowledge base example of the same name is obscured Survey, without manually checking each first object, save substantial amounts of manpower, and substantially increase detection efficiency.
Embodiment three
Fig. 3 is the structure chart that the knowledge base example of the same name that the embodiment of the present invention three provides obscures detection device.The present embodiment The device of offer specifically can be used for performing the process flow that above method embodiment one provides, as shown in figure 3, the device bag Include:Acquisition module 301, constructing module 302 and Cluster Analysis module 303.
Wherein, acquisition module 301 is used to obtain text library, and text library content is related to the content of knowledge base, text library bag Include at least one text, each text includes at least one sentence, and knowledge base includes multiple examples, each example include it is multiple by The ordered set of N number of sentence composition, N are the positive integer more than or equal to 3.Constructing module 302 is used to obtain first object, according to First object and text library, the set of the corresponding object vector of construction first object, wherein the dimension of each object vector and text The number of text is equal in this storehouse;Wherein, first object is any one example in knowledge base.Cluster Analysis module 303 is used In carrying out cluster analysis to each object vector, determine whether knowledge base occurs example of the same name and obscure according to the result of cluster analysis.
Device provided in an embodiment of the present invention specifically can be used for execution above-described embodiment one and provide embodiment of the method Process flow, details are not described herein again for concrete function.
The embodiment of the present invention is by obtaining the text library for having related content with knowledge base, for the first mesh in knowledge base Mark, first object can be any one example in knowledge base, and the corresponding target of each first object is constructed according to institute's text library The set of vector;By carrying out cluster analysis to each object vector, determine whether knowledge base occurs according to the result of cluster analysis Example of the same name is obscured;The automatic detection that knowledge base example of the same name is obscured is realized, without manually carrying out core to each first object It is right, substantial amounts of manpower is saved, and substantially increase detection efficiency.
Example IV
Fig. 4 is the structure chart that the knowledge base example of the same name that the embodiment of the present invention four provides obscures detection device.In above-mentioned reality On the basis of applying example three, in the present embodiment, last sentence of at least one ordered set in each example of knowledge base First sentence for the title of example, and each ordered set is the example ID of the corresponding example of ordered set, and example ID is used In one example of unique mark.
As shown in figure 4, constructing module 302 includes:Acquisition submodule 3021 and construction submodule 3022.Wherein, son is obtained Module 3021 is used to replacing with the example ID in the ordered set of first object into the title of the corresponding example of example ID, obtains the Corresponding second target of one target.Submodule 3022 is constructed to be used to be corresponded to according to the second target and text library, the second target of construction Object vector set.
The corresponding interim vector of each ordered set that submodule 3022 is specifically used for obtaining the second target is constructed, wherein each The dimension of interim vector is equal with the number of text in text library;For each ordered set in the second target, determine each Whether text includes all object statements in ordered set;If it is determined that result is yes, then by ordered set it is corresponding temporarily to The value of dimension corresponding with text is arranged to first object value in amount;If it is determined that result is no, then face ordered set is corresponding The value of dimension corresponding with text is arranged to the second desired value in Shi Xiangliang, obtains the corresponding object vector of ordered set;Obtain The set of the corresponding object vector of second target.
Wherein, object statement is the sentence of at least two predeterminated positions in ordered set.
Alternatively, the sentence of at least two predeterminated positions is the language of first position and last position in ordered set Sentence.
Cluster Analysis module 303 is specifically used for determining the similar of any two object vector in the set of object vector Degree;Judge whether each similarity is respectively less than first threshold;If judging result is no, merge similarity maximum two targets to Measure, and the object vector after merging is updated to the object vector in the set of object vector;Return to perform and determine object vector Set in any two object vector similarity operation, until vector set in object vector number be 1, And determine that knowledge base does not occur example of the same name and obscures.
Cluster Analysis module 303 is additionally operable to if the determination result is YES, it is determined that knowledge base occurs example of the same name and obscures.
Device provided in an embodiment of the present invention specifically can be used for execution above-described embodiment two and provide embodiment of the method Process flow, details are not described herein again for concrete function.
The embodiment of the present invention is by obtaining the text library for having related content with knowledge base, for the first mesh in knowledge base Mark, first object can be any one example in knowledge base, and the corresponding target of each first object is constructed according to institute's text library The set of vector;By carrying out cluster analysis to each object vector, determine whether knowledge base occurs according to the result of cluster analysis Example of the same name is obscured;The automatic detection that knowledge base example of the same name is obscured is realized, without manually carrying out core to each first object It is right, substantial amounts of manpower is saved, and substantially increase detection efficiency.
In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, can pass through it Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be the INDIRECT COUPLING or logical by some interfaces, device or unit Letter connection, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical location, you can with positioned at a place, or can also be distributed to multiple In network unit.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment the method.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various Can be with the medium of store program codes.
Those skilled in the art can be understood that, for convenience and simplicity of description, only with above-mentioned each function module Division progress for example, in practical application, can be complete by different function modules by above-mentioned function distribution as needed Into the internal structure of device being divided into different function modules, to complete all or part of function described above.On The specific work process of the device of description is stated, may be referred to the corresponding process in preceding method embodiment, details are not described herein.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe is described in detail the present invention with reference to foregoing embodiments, it will be understood by those of ordinary skill in the art that:Its according to Can so modify to the technical solution described in foregoing embodiments, either to which part or all technical characteristic into Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme.

Claims (12)

1. a kind of method that knowledge base example of the same name obscures detection, it is characterised in that including:
Text library is obtained, the text library content is related to the content of knowledge base, and the text library includes at least one text, often A text includes at least one sentence, and the knowledge base includes multiple examples, and each example includes multiple by N number of sentence group Into ordered set, N is positive integer more than or equal to 3;
First object is obtained, according to the first object and the text library, constructs the corresponding object vector of the first object Set;Wherein, the dimension of each object vector is equal with the number of text in the text library, and the first object is Any one example in the knowledge base;
Cluster analysis is carried out to each object vector, determines whether the knowledge base occurs according to the result of the cluster analysis Example of the same name is obscured.
2. according to the method described in claim 1, it is characterized in that, first sentence of each ordered set has to be described The example ID of the corresponding example of ordered sets, and in each example of the knowledge base at least one ordered set last language Sentence is the title of the example, and the example ID is used for one example of unique mark,
Then according to the first object and the text library, the set of the corresponding object vector of the first object is constructed, including:
Example ID in the ordered set of the first object is replaced with to the title of the corresponding example of the example ID, obtains institute State corresponding second target of first object;
According to second target and the text library, the set of the corresponding object vector of construction second target.
It is 3. according to the method described in claim 2, it is characterized in that, described according to second target and the text library, structure The set of the corresponding object vector of second target is made, including:
The corresponding interim vector of each ordered set of second target is obtained, wherein the dimension of each interim vector and institute The number for stating text in text library is equal;
For each ordered set in second target, it is all in the ordered set to determine whether each text includes Object statement;
If it is determined that result is yes, then the value of dimension corresponding with the text in the corresponding interim vector of the ordered set is set It is set to first object value;If it is determined that result is no, then will be corresponding with the text in the corresponding interim vector of the ordered set The value of dimension be arranged to the second desired value, obtain the corresponding object vector of the ordered set;
Obtain the set of the corresponding object vector of second target;
Wherein, the object statement is the sentence of at least two predeterminated positions in the ordered set.
4. according to the method described in claim 3, it is characterized in that, the sentence of at least two predeterminated position is described orderly The sentence of first position and last position in set.
5. according to the method described in claim 4, it is characterized in that, it is described to the object vector carry out cluster analysis, according to The result of the cluster analysis determines whether the knowledge base occurs example of the same name and obscure, including:
Determine the similarity of any two object vector in the set of the object vector;
Judge whether each similarity is respectively less than first threshold;
If judging result is no, merge two object vectors of similarity maximum, and the object vector after merging is updated to Object vector in the set of the object vector;
The operation for the similarity for performing any two object vector in the set for determining the object vector is returned, until The number of object vector in the vectorial set is 1, and determines that the knowledge base does not occur example of the same name and obscures.
6. according to the method described in claim 5, it is characterized in that, it is described to the object vector carry out cluster analysis, according to The result of the cluster analysis determines whether the knowledge base occurs example of the same name and obscure, and further includes:
If the determination result is YES, it is determined that the knowledge base occurs example of the same name and obscures.
7. a kind of knowledge base example of the same name obscures detection device, it is characterised in that including:
Acquisition module, for obtaining text library, the text library content is related to the content of knowledge base, and the text library is included extremely A few text, each text include at least one sentence, and the knowledge base includes multiple examples, and each example includes more A ordered set being made of N number of sentence, N are the positive integer more than or equal to 3;
Constructing module, for obtaining first object, according to the first object and the text library, constructs the first object pair The set for the object vector answered, wherein the dimension of each object vector is equal with the number of text in the text library;Its In, the first object is any one example in the knowledge base;
Cluster Analysis module, for carrying out cluster analysis to each object vector, determines according to the result of the cluster analysis Whether the knowledge base, which occurs example of the same name, is obscured.
8. device according to claim 7, it is characterised in that at least one ordered set in each example of the knowledge base Last sentence closed is the title of the example, and first sentence of each ordered set is the ordered set The example ID of corresponding example, the example ID are used for one example of unique mark,
Then the constructing module includes:
Acquisition submodule, for the example ID in the ordered set of the first object to be replaced with the corresponding reality of the example ID The title of example, obtains corresponding second target of the first object;
Construct submodule, for according to second target and the text library, construct the corresponding target of second target to The set of amount.
9. device according to claim 8, it is characterised in that the construction submodule is specifically used for:
The corresponding interim vector of each ordered set of second target is obtained, wherein the dimension of each interim vector and institute The number for stating text in text library is equal;
For each ordered set in second target, it is all in the ordered set to determine whether each text includes Object statement;
If it is determined that result is yes, then the value of dimension corresponding with the text in the corresponding interim vector of the ordered set is set It is set to first object value;If it is determined that result is no, then will be corresponding with the text in the corresponding interim vector of the ordered set The value of dimension be arranged to the second desired value, obtain the corresponding object vector of the ordered set;
Obtain the set of the corresponding object vector of second target;
Wherein, the object statement is the sentence of at least two predeterminated positions in the ordered set.
10. device according to claim 9, it is characterised in that the sentence of at least two predeterminated position has to be described The sentence of first position and last position in ordered sets.
11. device according to claim 10, it is characterised in that the Cluster Analysis module is specifically used for:
Determine the similarity of any two object vector in the set of the object vector;
Judge whether each similarity is respectively less than first threshold;
If judging result is no, merge two object vectors of similarity maximum, and the object vector after merging is updated to Object vector in the set of the object vector;
The operation for the similarity for performing any two object vector in the set for determining the object vector is returned, until The number of object vector in the vectorial set is 1, and determines that the knowledge base does not occur example of the same name and obscures.
12. according to the devices described in claim 11, it is characterised in that the Cluster Analysis module is additionally operable to:
If the determination result is YES, it is determined that the knowledge base occurs example of the same name and obscures.
CN201610974455.2A 2016-10-27 2016-10-27 Method and device for confusion detection of synonym instances of knowledge base Expired - Fee Related CN108009184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610974455.2A CN108009184B (en) 2016-10-27 2016-10-27 Method and device for confusion detection of synonym instances of knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610974455.2A CN108009184B (en) 2016-10-27 2016-10-27 Method and device for confusion detection of synonym instances of knowledge base

Publications (2)

Publication Number Publication Date
CN108009184A true CN108009184A (en) 2018-05-08
CN108009184B CN108009184B (en) 2021-08-27

Family

ID=62048538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610974455.2A Expired - Fee Related CN108009184B (en) 2016-10-27 2016-10-27 Method and device for confusion detection of synonym instances of knowledge base

Country Status (1)

Country Link
CN (1) CN108009184B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233656A1 (en) * 2006-03-31 2007-10-04 Bunescu Razvan C Disambiguation of Named Entities
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base
CN103699689A (en) * 2014-01-09 2014-04-02 百度在线网络技术(北京)有限公司 Method and device for establishing event repository
CN104615687A (en) * 2015-01-22 2015-05-13 中国科学院计算技术研究所 Entity fine granularity classifying method and system for knowledge base updating
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233656A1 (en) * 2006-03-31 2007-10-04 Bunescu Razvan C Disambiguation of Named Entities
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base
CN103699689A (en) * 2014-01-09 2014-04-02 百度在线网络技术(北京)有限公司 Method and device for establishing event repository
CN104615687A (en) * 2015-01-22 2015-05-13 中国科学院计算技术研究所 Entity fine granularity classifying method and system for knowledge base updating
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李超: "面向新闻领域的人名消歧方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN108009184B (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN108288067A (en) Training method, bidirectional research method and the relevant apparatus of image text Matching Model
CN102890698B (en) Method for automatically describing microblogging topic tag
Einasto et al. Multimodality of rich clusters from the SDSS DR8 within the supercluster-void network
CN105630800B (en) Method and system for ordering node importance
CN110032650B (en) Training sample data generation method and device and electronic equipment
CN109120431B (en) Method and device for selecting propagation source in complex network and terminal equipment
US20110231418A1 (en) Graph transformation
Mozafari et al. Improving the robustness of scale-free networks by maintaining community structure
CN109471953A (en) A kind of speech data retrieval method and terminal device
CN109684517A (en) A kind of historical data storage method, reading/writing method, storage device and equipment
Zhao et al. Identifying influential nodes based on graph signal processing in complex networks
CN109657060B (en) Safety production accident case pushing method and system
CN108009184A (en) Knowledge base example of the same name obscures the method and device of detection
CN110011971A (en) A kind of manual configuration method of network security policy
CN104850591A (en) Data conversion storage method and device
Xiang et al. A novel particle swarm optimizer with time-delay
De et al. Unsupervised clustering technique to harness ideas from an Ideas Portal
CN105975482A (en) Vertical search-based relevancy ranking realization method and apparatus
Li et al. Integrating attributes of nodes solves the community structure partition effectively
CN106156259A (en) A kind of user behavior information displaying method and system
CN104008146A (en) Image query expansion method and system based on common visual patterns
Piasecki et al. Information spreading in expanding wordnet hypernymy structure
Tyuryukanov et al. Generator grouping cutset determination based on tree construction and constrained spectral clustering
CN106372089B (en) Determine the method and device of word position
Hou et al. A small-world network derived from the deterministic uniform recursive tree by line graph operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230626

Address after: Room 3007, Hengqin International Financial Center Building, No. 58, Huajin Street, Hengqin New District, Haidian District, Beijing

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210827

CF01 Termination of patent right due to non-payment of annual fee