CN108415950A - A kind of hypernym polymerization and device - Google Patents

A kind of hypernym polymerization and device Download PDF

Info

Publication number
CN108415950A
CN108415950A CN201810100677.0A CN201810100677A CN108415950A CN 108415950 A CN108415950 A CN 108415950A CN 201810100677 A CN201810100677 A CN 201810100677A CN 108415950 A CN108415950 A CN 108415950A
Authority
CN
China
Prior art keywords
hypernym
pending
entity type
word
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810100677.0A
Other languages
Chinese (zh)
Other versions
CN108415950B (en
Inventor
郑孙聪
李潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810100677.0A priority Critical patent/CN108415950B/en
Publication of CN108415950A publication Critical patent/CN108415950A/en
Application granted granted Critical
Publication of CN108415950B publication Critical patent/CN108415950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the information processing technology, more particularly to a kind of hypernym polymerization and device.To improve the accuracy of hypernym polymerization.This method is:The word vector that terminal device includes according to each hypernym calculates the term vector similarity between each hypernym, and the entity type associated by the corresponding entity of each hypernym calculates the entity type similarity between each hypernym, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches each hypernym of the second pre-determined threshold and polymerize;In this way, short text as similar hypernym can be effectively treated, both the text key information that hypernym includes can effectively have been excavated, the type feature of hypernym can be accurately depicted again, simultaneously not only can be to avoid the miscellaneous work amount of artificial design features, but also the generalization ability of model can be enhanced, efficiently identify invalid hypernym, the redundant data in hypernym is removed, the accuracy of hypernym polymerization is significantly improved.

Description

A kind of hypernym polymerization and device
Technical field
The present invention relates to the information processing technology, more particularly to a kind of hypernym polymerization and device.
Background technology
In the hypernym network that knowledge based collection of illustrative plates generates, in order to avoid there is hypernym redundancy issue, it usually needs Hypernym with identical semanteme is polymerize, that is, is directed to same semanteme, is drawn into the hypernym using different expression ways And it merges.Such as:Hypernym about slr camera has:" slr camera ", " slr camera being commonly called as ", " list is anti- Camera " " LR camera " etc., it is upper that these with hypernym identical semantic but that description is different are referred to as identical semanteme Word.The process that these identical semantic hypernyms merge and are indicated with a common name is referred to as the polymerization of hypernym Process.Merge the redundancy issue that the hypernym with identical semanteme can reduce hypernym network, improves the matter of hypernym network Amount.
In prior art, it will usually realize hypernym polymerization using two ways.
First way is:It is clustered primarily directed to similar semantic text.
Common method would generally indicate text using term vector, bag of words, topic model etc. feature, then sharp With common clustering algorithm, such as:K-means, hierarchical clustering, the methods of spectral clustering obtain Similar Text set.
The relatively large number of similar semantic text of number of words can be condensed together using first way, i.e., can only meet phase Like the polymerization task of semantic long text, polymerization accuracy is relatively low.
And it is a kind of high-precision Semantic Clustering task to carry out polymerization to the hypernym of identical semanteme, and therefore, first way It is difficult to efficiently solve.
The second way is:Mainly from character string similar angle, it is non-to merge statement using the methods of editing distance Normal similar short text.
The aggregation problem of the hypernym of identical semanteme can be solved using the second way, still, such mode is only caught The character string information of hypernym is caught, and judges whether the two states the same thing by the similarity between calculating character string Object.And in fact, the same things often has different describing modes, such as:" children " and " child ", the two is semantic the same, But character string is entirely different.Therefore, the hypernym with similar semantic is merged by the way of based on editing distance also is had Certain limitation.
In view of this, needing to redesign a kind of method of hypernym polymerization to overcome drawbacks described above.
Invention content
A kind of hypernym polymerization of offer of the embodiment of the present invention and device, to improve the accuracy of hypernym polymerization.
Specific technical solution provided in an embodiment of the present invention is as follows:
A kind of hypernym polymerization, including:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold Corresponding pending hypernym is polymerize.
A kind of hypernym polyplant, including:
First determination unit for obtaining multiple pending hypernyms, and determines that each pending hypernym includes respectively Each word word vector, and the vector of each word based on acquisition calculates each pending hypernym according to special algorithm Term vector;
Second determination unit, for determining each pending hypernym associated entity type in knowledge mapping respectively;
Computing unit is used for the term vector based on each pending hypernym and associated entity type, calculates separately every Term vector similarity between two pending hypernyms and entity type similarity;
Polymerized unit reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches second When pre-determined threshold, corresponding pending hypernym is polymerize.
A kind of storage medium stores the program for realizing hypernym polymerization, when described program is run by processor, Execute following steps:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold Corresponding pending hypernym is polymerize.
A kind of computer installation, including one or more processors;And one or more computer-readable mediums, it is described Instruction is stored on readable medium, when described instruction is executed by one or more of processors so that on described device executes State any one method.
In the embodiment of the present invention, word vector that terminal device includes according to each hypernym calculates between each hypernym Term vector similarity, and entity type associated by the corresponding entity of each hypernym calculate between each hypernym Entity type similarity, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches the second pre- gating Each hypernym of limit is polymerize;Since hypernym is typically to be made of a little several words, meeting is operated using traditional participle Larger error and information loss are brought, therefore, in the embodiment of the present invention, the word vector for including based on hypernym is characterized Term vector and based on the associated entity type of hypernym come carry out the similarity between hypernym judge, class can be effectively treated The short text like as hypernym not only can effectively excavate the text key information that hypernym includes, but also can accurately carve The type feature of hypernym is drawn, while not only can be to avoid the miscellaneous work amount of artificial design features, but also mould can be enhanced The generalization ability of type efficiently identifies invalid hypernym, removes the redundant data in hypernym, significantly improves hypernym polymerization Accuracy.
Description of the drawings
Fig. 1 is knowledge mapping example schematic under prior art;
Fig. 2 is entity type example schematic under prior art;
Fig. 3 is that knowledge based collection of illustrative plates carries out hypernym polymerization process schematic diagram in the embodiment of the present invention;
Fig. 4 A are to be associated with schematic diagram between pending hypernym and entity in the embodiment of the present invention;
Fig. 4 B are the association schematic diagram between entity and entity type in the embodiment of the present invention;
Fig. 5 is terminal function structural schematic diagram in the embodiment of the present invention;
Fig. 6 is Computer functions of the equipments structural schematic diagram of the embodiment of the present invention.
Specific implementation mode
In order to improve the accuracy of hypernym polymerization, in the embodiment of the present invention, include by each pending hypernym Word vector determines the term vector of each pending hypernym, and combines each pending hypernym corresponding in knowledge mapping Entity type, to judge the semantic similarity between each pending hypernym, to will be singled out that there is the upper of identical semanteme Position word is polymerize.This not only considers the semantic information of hypernym itself, it is also contemplated that the associated entity type letter of hypernym Breath, therefore, can meet high-precision semantics fusion demand.
The preferred embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.
For the ease of introducing background technology, first part term is defined.
Knowledge mapping:Knowledge Graph/Vault, also known as mapping knowledge domains are known as knowing in books and information group The field of awareness visualizes or ken maps map, is explicit knowledge's development process and a series of a variety of different figures of structural relation Shape, with visualization technique Description of Knowledge resource and its carrier, excavate, analysis, structure, draw and explicit knowledge and they between It connects each other.
As shown in fig.1, in knowledge mapping, a node is known as an entity, and called entity is knowledge mapping Introduce object.Such as, it is assumed that a node is " Liu ", that is, represents an entity, and the attribute that property set includes has occupation, goes out Phase birthday and hobby, etc..
Hypernym:Hypernym refers to the wider array of descriptor of conceptive extension.
Such as:" carnivore " is the hypernym of Tiger, and " felid " can also be the hypernym of Tiger, because This, hypernym can be understood as the cluster classification that entity is obtained according to attributive character.
Such as, Tiger can be obtained into " carnivority animal " this hypernym according to attribute " carnivority " cluster.For another example, will Tiger can obtain " felid " this hypernym according to attribute " animal section " cluster.
Entity type:Entity in knowledge mapping all corresponds to an entity type, and entity type can be regarded as entity Generality sort out.One entity type may include multiple entities.Such as:The entity type of entity " rose " is " plant Class ";For another example, film《Warwolf 2》Entity type be " film class ".
For example, as shown in fig.2, entity " tiger ", " tortoise " and " butterfly " has, there are one identical entity types " animal class ".
Term vector:It is a kind of distributed expression of word, basic thought refers to that word is mapped as a fixed dimension Vector (dimension be much smaller than dictionary size), the vector of these words constitutes term vector semantic space, semantic similar word Distance usually in space is closer.
Word vector:It is a kind of distributed expression of " word " level, " word " is mapped in semantic space, one of word is obtained Semantic vector, distance is closer usually in semantic space for the word vector of similar semantic.
Density interpolation vectorization method (Dense Interpolated Embedding, DIE) is a kind of based on word vector Synthesize a kind of method of term vector, empirical evidence it can effectively indicate the character string of similar description.
In the embodiment of the present invention, in pretreatment stage, terminal device can be based on encyclopaedia language material, be instructed using word2vec tools It practises handwriting vector, the source language material for the plain text language material and hypernym that when training word vector uses is consistent.In this manner it is ensured that Each word that hypernym includes vector, can accurate characterization hypernym in the feature of text level, and then can be follow-up It generates term vector and has established good basis.
Specifically, can a point word processing first be carried out to plain text:Continuous English alphabet is used as one as a word, number A word, middle word are a word;Then, for dividing the plain text language material after word processing, using Word2vec model trainings word to Amount is used for DIE algorithms.Since DIE is a kind of algorithm of splicing word vector, so the generally setting of the dimension of word vector is smaller, it can Choosing, in the embodiment of the present invention, the dimension of a word vector is set as 25, i.e. a word vector has the spy in 25 dimensions Sign.
As shown in fig.3, in the embodiment of the present invention, the detailed process that terminal device polymerize hypernym is as follows:
Step 300:Terminal device obtains multiple pending hypernyms, and determines that each pending hypernym includes respectively The word vector of each word, and the vector of each word based on acquisition, each pending hypernym is calculated according to special algorithm Term vector.
Optionally, the special algorithm that terminal device uses can be DIE algorithms.
Specifically, by any one pending hypernym, (for hereinafter referred to as hypernym x), introduction step 300 is held Line mode is as follows:
In the embodiment of the present invention, in order to draw the character string information of hypernym x and the text semantic letter of hypernym x in the same time Breath, optionally, using the term vector of DIE algorithms synthesis hypernym x.
The basic thought of DIE algorithms is:Hypernym x term vectors are made of the word vector of hypernym x, the word of different location The different piece of vector composition term vector, can ensure character string order information in this way, in addition, word vector is based on extensive non- Structured text trains to obtain, and word vector contains certain Semantic Similarity, so the hypernym x based on the synthesis of word vector Term vector has certain semantic feature.Specific implementation procedure is as follows:
First, preset at least two subregions of corresponding hypernym x are determined, wherein a sub-regions correspond to hypernym x's The partial dimensional of term vector;
Secondly, it is based on the corresponding each word vector of the pending hypernym, calculates the provincial characteristics of each sub-regions;
Specifically, can be directed to each sub-regions respectively executes following operation:
Based on the word number of vectors that preset subregion number and hypernym x include, determine that hypernym x includes every respectively Weight of one word vector in a sub-regions;
According to each word vector and each weight of word vector in one subregion, calculates hypernym x and exist Provincial characteristics in one subregion.
Finally, each region feature of the hypernym x based on acquisition calculates the term vector for obtaining hypernym x.
It is described for example, following formula may be used in DIE algorithms:
V=[v [0] ..., v [m] ... h [M-1]], m ∈ [0, M-1]
Wherein, i characterizes the serial number of word vector, and I indicates that word number of vectors, m indicate that the serial number of subregion, M indicate subregion Number indicates that the dimension of the term vector of synthesis is M times of word vector, v indicates that the provincial characteristics of subregion, V indicate hypernym Term vector, chari indicates i-th of character corresponding word vector in hypernym.In the embodiment of the present invention, so-called provincial characteristics is Refer to:The feature for the text level that the partial dimensional of term vector corresponding to subregion is embodied.
Such as, it is assumed that hypernym x is " mammal ", and the dimension of term vector is 100, has divided four sub-regions, respectively For [1,25], [26,50], [51,75], [76,100], then,
V [0]=word vector " food in one's mouth " × f (0,0)+word vector " breast " × f (1,0)+word vector " dynamic " × f (2,0)+word vector " object " × f (3,0)
V [1]=word vector " food in one's mouth " × f (0,1)+word vector " breast " × f (1,1)+word vector " dynamic " × f (2,1)+word vector " object " × f (3,1)
V [2]=word vector " food in one's mouth " × f (0,2)+word vector " breast " × f (1,2)+word vector " dynamic " × f (2,2)+word vector " object " × f (3,2)
V [3]=word vector " food in one's mouth " × f (0,3)+word vector " breast " × f (1,3)+word vector " dynamic " × f (2,3)+word vector " object " × f (3,3)
V=[v [0], v [1], v [2], v [3]
Wherein, f (i, m) indicates weight of the word vector in a certain subregion, wherein x indicates the serial number of word vector, y tables Show the serial number of subregion, for example, f (0,0) is indicated, weight of the 0th word vector " food in one's mouth " in the 0th sub-regions [1,25].
Terminal device has carried out region division for the term vector of pending hypernym, and per sub-regions, correspondence waits locating respectively The partial dimensional of hypernym is managed, i.e., each sub-regions are provided with the provincial characteristics of itself, and are contained in pending hypernym Multiple word vectors, different word vectors is different to the contribution degree of the provincial characteristics of respective sub-areas in different subregions, because This, for each word vector that pending hypernym includes, the weight being respectively provided in different subregions can enable every The provincial characteristics of one sub-regions is embodied by the larger corresponding dimension of word vector of weight, in this way, each provincial characteristics The text feature for only focusing on realizational portion word vector, so as to effectively promote text specific aim and the spy of each region feature Accuracy is levied, and then improves the accuracy for the term vector being finally calculated.
Step 310:Terminal device determines each pending hypernym associated entity type in knowledge mapping respectively.
In the embodiment of the present invention, several entities can be corresponded in a pending hypernym knowledge mapping, and these entities Often corresponding at least one entity type, entity type are that the generality of entity is sorted out, and can embody entity in a certain respect Feature.
For example, refering to shown in Fig. 4 A and Fig. 4 B, it is assumed that pending hypernym is:" star of the nineties ", and it is in knowledge Several entities have been corresponded in collection of illustrative plates, e.g., " Liu ", " Zhang ", " Mr. Li ", " Guo so-and-so " etc., wherein
" Liu " and " Mr. Li " is jointly corresponding " video display stars ", and " Liu " and " Guo so-and-so " common correspondence " singer's class ", it is clear that " Liu " corresponded to two different entity types.And " Mr. Li " and " Guo so-and-so " is right respectively Different entity types is answered.
It is (hereinafter referred to as upper by taking any one pending hypernym as an example when executing step 310 for such case Word x), terminal device can determine whether out hypernym x corresponding all entities in knowledge mapping, and determine that all entities are each The entity type of auto correlation, and the most N number of entity type of associated number of entities is filtered out, as the associated realities of hypernym x Body type, wherein N is default natural number, N >=1.
For example, it is assumed that hypernym x is " XX popularity highests male ", and in knowledge mapping, the associated entities of hypernym x There are " Sun Yang ", " Wu x is all ", " LiuxLiang ", " Yuan x is flat ", " king x is clever ", " small vest " etc..
Wherein, " Sun Yang " and " LiuxLiang " corresponding entity type is " sportsman's class ", " Wu x is all " corresponding entity type It is " scientist's class " for " stars ", " Yuan x is flat " corresponding entity type, and " king x is clever " and " small vest " correspond to " net red Class ".
Assuming that in all entities of corresponding hypernym x, corresponding " sportsman's class " has 20 entities, corresponding " stars " to have 50 entities, corresponding " scientist's class " have 5 entities, corresponding " netting red class " to have 40 entities.
So, through screening, it is assumed that N=3 then finally determines that hypernym x has corresponded to three entity types, i.e., " stars ", " netting red class " and " sportsman's class ".
Step 320:Term vector and associated entity type of the terminal device based on each pending hypernym, calculate separately Term vector similarity between the pending hypernym of each two and entity type similarity.
By taking any one group two-by-two pending hypernym as an example, hereinafter referred to as hypernym x and hypernym y:
So, it is possible, firstly, to calculate word between the corresponding term vectors of hypernym x and the corresponding term vectors of hypernym y to Similarity is measured, is denoted as, sim1;
Secondly, the entity class between the corresponding entity types of hypernym x and the corresponding entity types of hypernym y can be calculated Type similarity, is denoted as sim2,
Specifically, can first determine that the associated entity types of hypernym x are associated with hypernym y in two pending hypernyms Entity type, wherein if hypernym x or/and hypernym y is associated at least two entity types, in hypernym x and hypernym The entity type similarity of each two entity type is calculated separately between y, and it is highest as final reality to choose similarity value Body type similarity.
Such as:The associated entity types of hypernym x have " video display stars ", and the associated entity types of hypernym y have " video display Stars " and " singer's class " then calculate separately following two group objects type similarity:
Hypernym x " video display stars " & hypernyms y " video display stars "=100%
Hypernym x " video display stars " & hypernyms y " singer's class "=40%,
The 100% entity type similarity final as hypernym x and hypernym y can then be taken.
Step 330:When term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre- gating In limited time, corresponding pending hypernym polymerize by terminal device.
Still by taking any one group two-by-two pending hypernym as an example, hereinafter referred to as hypernym x and hypernym y
Specifically, assuming that sim1 and sim2 between hypernym x and hypernym y meet the following conditions, then hypernym x is characterized It is most like hypernym with hypernym y, can be polymerize.
Sim (i, j) >=T1
Sim (i, j) >=T2
Wherein, T1 is preset first pre-determined threshold, and T2 is preset second pre-determined threshold, and T1 and T2 can be by O&M people Member is configured according to practical work experience, and details are not described herein.
Above-mentioned steps 300-340 only describes a polymerization process, and terminal device can be existed using this scheme repeatedly The hypernym with identical semanteme is searched in pending hypernym (can include the hypernym after polymerization), and is gathered repeatedly It closes, to finally obtain the hypernym after the most accurately polymerizeing.
After polymerization is handled, terminal device will can not find the pending hypernym of most like hypernym separately as one Class, and will find most like hypernym pending hypernym polymerize with most like hypernym after become one kind, finally obtain The hypernym after each Type of Collective was obtained, these hypernyms are by precisely screening polymerization, eliminating the hypernym of redundant data.
Further, in order to improve polymerization accuracy, optionally, can to each pending hypernym after polymerization again into The primary polymerization accuracy of row judges, specifically, each group of pending hypernym that terminal device can be directed to after polymerization respectively is held The following operation of row:
A) terminal device determines the Similar Text part between each pending hypernym in one group of pending hypernym.
Certainly, before determining Similar Text part, optionally, terminal device can first remove each pending hypernym In stop words and many words, for example, rarely used word, " " " " " " etc. auxiliary words of mood etc..Wherein, stop words is by deactivating Dictionary provides, and many words are the words that each pending hypernym includes no practical significance.
Then, terminal device can search Similar Text part between each pending hypernym, for example, " most popular Singer " and " singer of most popularity " in " singer " and " singer " Similar Text part can be considered as, " most " and " most " also may be used To be considered as Similar Text part.
B) terminal device deletes the Similar Text part between each pending hypernym.
After deleting " most " and " most ", and " singer " and " singer ", remaining textual portions are " prevalence " and " popularity ".
C) terminal device calculates between each pending hypernym the semantic similarity of remaining textual portions and described surplus The average number of words that remaining textual portions include.
D) determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions Including average number of words be less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement is directed to described one The polymerization processing that the pending hypernym of group carries out is effective.
Since " prevalence " and " popularity " hint expression is close, and the average number of words that remaining textual portions include is only 2, is less than 4th setting thresholding " 2.2 ", it is determined that merge effectively, i.e., the two are pending by " most popular singer " and " singer of most popularity " Hypernym can be polymerize.
The above process is made below by two embodiments and being described in further detail.
Embodiment 1:
Pending hypernym is:" poet's Chen Xiang inflammation works ", " works of poet Mei Shaojing ", " works of poet Zhao Gong ", " works of poet Lu Zhi " and " works of poet Wang Yi ".
Although these pending hypernyms seem similar, but actual key message is inconsistent, therefore, are getting rid of phase After textual portions, remaining textual portions are:Chen Xiangyan, Mei Shaojing, Zhao Gong, Lu Zhi, Wang Yi.
Semantic similarity between above-mentioned residue textual portions is less than third pre-determined threshold, and the average number of words for including is about 2.4, it is higher than the 4th pre-determined threshold " 2.2 ", then characterizes this time polymerization in vain, these pending hypernyms cannot be polymerize.
Embodiment 2:
Pending hypernym is:" simple daily life of a family steamed dumpling ", " homely steamed dumpling " and " " steamed dumpling ", gets rid of Similar Text portion Divide and stop words is with after many words, remaining textual portions are:NULL, NULL and NULL.
Above-mentioned residue textual portions are sky, then characterize this time polymerization effectively, these pending hypernyms can merge.
Further, after determining that one group of pending hypernym can be polymerize, one group after polymerization can be waited locating After the maximum common characters string in hypernym between each pending hypernym is managed as described one group pending hypernym polymerization Title.
Such as:" simple daily life of a family steamed dumpling ", " homely steamed dumpling " and " steamed dumpling ", wherein maximum common characters string:Steamed dumpling, then Steamed dumpling may be used name polymerization after pending hypernym, these carry out retrieval and using when also effectively increase inquiry Efficiency.
Based on above-described embodiment, as shown in fig.5, in the embodiment of the present invention, terminal device includes at least first and determines list First 51, second determination unit 52, computing unit 53 and polymerized unit 54, wherein
First determination unit 51 for obtaining multiple pending hypernyms, and determines each pending hypernym packet respectively The word vector of each word contained, and the vector of each word based on acquisition, calculate each pending upper according to special algorithm The term vector of word;
Second determination unit 52, for determining each pending hypernym associated entity class in knowledge mapping respectively Type;
Computing unit 53 is used for the term vector based on each pending hypernym and associated entity type, calculates separately Term vector similarity between the pending hypernym of each two and entity type similarity;
Polymerized unit 54 reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches When two pre-determined thresholds, corresponding pending hypernym is polymerize.
Optionally, the special algorithm that the first determination unit 51 uses is the vectorization of density interpolation (DIE) algorithm.
Each word vector based on acquisition, when calculating the term vector of each pending hypernym according to special algorithm, first Determination unit 51 is used for:
Corresponding pending preset at least two subregion of hypernym is determined according to the dimension of pending hypernym, wherein One sub-regions correspond to the partial dimensional of the term vector;
Based on the corresponding each word vector of the pending hypernym, the provincial characteristics of each sub-regions is calculated;
Each region feature of the pending hypernym based on acquisition calculates the word for obtaining the pending hypernym Vector.
Optionally, it is based on the corresponding each word vector of the pending hypernym, calculates the provincial characteristics of each sub-regions When, the first determination unit 51 includes:
It is directed to each sub-regions respectively and executes following operation:
Based on the word number of vectors that preset subregion number and the pending hypernym include, respectively determine described in wait for Each weight of word vector in a sub-regions for including in processing hypernym;
According to each word vector and each weight of word vector in one subregion, calculate described pending Provincial characteristics of the hypernym in one subregion.
Pending hypernym is determined in knowledge mapping when associated entity type, the second determination unit 52 is used for:
Determine pending hypernym corresponding all entities in knowledge mapping;
Determine all entities respectively associated entity type;
The most N number of entity type of associated number of entities is filtered out, is closed as any one described pending hypernym The entity type of connection, wherein N is default natural number, N >=1.
Optionally, when calculating the entity type similarity between the pending hypernym of each two, computing unit 53 is used for:
Determine the associated entity type of the first hypernym and the associated entity of the second hypernym in two pending hypernyms Type;
If first hypernym or/and at least two entity type of the second upper word association, described first The entity type similarity of each two entity type is calculated separately between hypernym and second hypernym;And
It is highest as final entity type similarity to choose similarity value.
After polymerizeing to one group of pending hypernym, polymerized unit 54 is further used for:
Determine the Similar Text part between each pending hypernym in one group of pending hypernym;
Delete the Similar Text part;
Calculate the semantic similarity of remaining textual portions and the remaining textual portions between each pending hypernym Including average number of words;
Determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions packet The average number of words contained is less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement is directed to described one group The polymerization processing that pending hypernym carries out is effective.
Optionally, before determining the Similar Text part in one group of pending hypernym between each pending hypernym, Polymerized unit 54 is further used for:
Among each pending hypernym, preset stop words and many words are removed.
Polymerized unit 54 is further used for:
Using the maximum common characters string between each pending hypernym in one group of pending hypernym after polymerization as Title after one group of pending hypernym polymerization.
Based on same inventive concept, the embodiment of the present invention provides a kind of storage medium, and storage polymerize for realizing hypernym The program of method when described program is run by processor, executes following steps:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold Corresponding pending hypernym is polymerize.
As shown in fig.6, being based on same inventive concept, the embodiment of the present invention provides a kind of computer installation, including one Or multiple processors 60;And one or more computer-readable mediums 61, instruction is stored on the readable medium 61, it is described When instruction is executed by one or more of processors 60 so that the computer installation executes times introduced in above-described embodiment It anticipates a kind of method.
In conclusion in the embodiment of the present invention, terminal device is calculated according to the word vector that each hypernym includes on each Term vector similarity between the word of position, and entity type associated by the corresponding entity of each hypernym calculate on each Entity type similarity between the word of position, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches Each hypernym of second pre-determined threshold is polymerize;Since hypernym is typically to be made of a little several words, using traditional Participle operation can bring larger error and information loss, therefore, in the embodiment of the present invention, the word that includes based on hypernym to The characterized term vector of amount and judged to carry out the similarity between hypernym based on the associated entity type of hypernym, it can be with Short text as similar hypernym is effectively treated, not only can effectively excavate the text key information that hypernym includes, but also can Accurately to depict the type feature of hypernym, at the same not only can to avoid the miscellaneous work amount of artificial design features, but also The generalization ability that model can be enhanced efficiently identifies invalid hypernym, removes the redundant data in hypernym, significantly improves The accuracy of hypernym polymerization.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out the embodiment of the present invention various modification and variations without departing from this hair The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.

Claims (12)

1. a kind of hypernym polymerization, which is characterized in that including:
Multiple pending hypernyms are obtained, and determine the word vector for each word that each pending hypernym includes respectively;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately at least each two Term vector similarity between word and entity type similarity;And
It, will be corresponding when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold Pending hypernym polymerize.
2. the method as described in claim 1, which is characterized in that the special algorithm is density interpolation vectorization DIE algorithms.
3. the method as described in claim 1, which is characterized in that each word vector based on acquisition, according to special algorithm The term vector of each pending hypernym is calculated, including:
Corresponding pending preset at least two subregion of hypernym is determined according to the dimension of pending hypernym, wherein one Subregion corresponds to the partial dimensional of the term vector;
Based on the corresponding each word vector of the pending hypernym, the provincial characteristics of each sub-regions is calculated;And
Each region feature of the pending hypernym based on acquisition, be calculated the word of the pending hypernym to Amount.
4. method as claimed in claim 3, which is characterized in that it is based on the corresponding each word vector of the pending hypernym, The provincial characteristics of each sub-regions is calculated, including:
It is directed to each sub-regions respectively and executes following operation:
Based on the word number of vectors that preset subregion number and the pending hypernym include, determine respectively described pending Each weight of word vector in a sub-regions for including in hypernym;
According to each word vector and each weight of word vector in one subregion, calculate
Provincial characteristics of the pending hypernym in one subregion.
5. the method as described in claim 1, which is characterized in that determine pending hypernym associated entity in knowledge mapping Type, including:
Determine pending hypernym corresponding all entities in knowledge mapping;
Determine all entities respectively associated entity type;
The most N number of entity type of associated number of entities is filtered out, it is associated as any one described pending hypernym Entity type, wherein N is default natural number, N >=1.
6. method as claimed in claim 5, which is characterized in that calculate the entity type phase between the pending hypernym of each two Like degree, including:
Determine the associated entity type of the first hypernym and the associated entity type of the second hypernym in two pending hypernyms;
If first hypernym or/and at least two entity type of the second upper word association, upper described first The entity type similarity of each two entity type is calculated separately between word and second hypernym;And
It is highest as final entity type similarity to choose similarity value.
7. method as claimed in any one of claims 1 to 6, which is characterized in that carry out polymerizeing it to one group of pending hypernym Afterwards, further comprise:
Determine the Similar Text part between each pending hypernym in one group of pending hypernym;
Delete the Similar Text part;
Calculating the semantic similarity of remaining textual portions and the remaining textual portions between each pending hypernym includes Average number of words;
Determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions include Average number of words is less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement waits locating for described one group It is effective to manage the polymerization processing that hypernym carries out.
8. the method for claim 7, which is characterized in that determine each pending hypernym in one group of pending hypernym Between Similar Text part before, further comprise:
Among each pending hypernym, preset stop words and many words are removed.
9. the method for claim 7, which is characterized in that further comprise:
Using the maximum common characters string between each pending hypernym in one group of pending hypernym after polymerization as described in Title after one group of pending hypernym polymerization.
10. a kind of hypernym polyplant, which is characterized in that including:
First determination unit for obtaining multiple pending hypernyms, and determines that each pending hypernym includes every respectively The word vector of one word, and the vector of each word based on acquisition, the word of each pending hypernym is calculated according to special algorithm Vector;
Second determination unit, for determining each pending hypernym associated entity type in knowledge mapping respectively;
Computing unit is used for the term vector based on each pending hypernym and associated entity type, calculates separately each two Term vector similarity between pending hypernym and entity type similarity;
Polymerized unit reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches second and presets When thresholding, corresponding pending hypernym is polymerize.
11. a kind of storage medium, which is characterized in that store the program for realizing hypernym polymerization, described program is located When managing device operation, following steps are executed:
Multiple pending hypernyms are obtained, and determine the word vector for each word that each pending hypernym includes respectively;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, calculate separately the pending hypernym of each two it Between term vector similarity and entity type similarity;
It, will be corresponding when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold Pending hypernym polymerize.
12. a kind of computer installation, which is characterized in that including one or more processors;And one or more computers can Medium is read, instruction is stored on the readable medium, when described instruction is executed by one or more of processors so that described Device executes method as claimed in any one of claims 1-9 wherein.
CN201810100677.0A 2018-02-01 2018-02-01 Hypernym aggregation method and device Active CN108415950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810100677.0A CN108415950B (en) 2018-02-01 2018-02-01 Hypernym aggregation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810100677.0A CN108415950B (en) 2018-02-01 2018-02-01 Hypernym aggregation method and device

Publications (2)

Publication Number Publication Date
CN108415950A true CN108415950A (en) 2018-08-17
CN108415950B CN108415950B (en) 2021-03-23

Family

ID=63126797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810100677.0A Active CN108415950B (en) 2018-02-01 2018-02-01 Hypernym aggregation method and device

Country Status (1)

Country Link
CN (1) CN108415950B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829041A (en) * 2018-12-25 2019-05-31 出门问问信息科技有限公司 Question processing method and device, computer equipment and computer readable storage medium
CN110008972A (en) * 2018-11-15 2019-07-12 阿里巴巴集团控股有限公司 Method and apparatus for data enhancing

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7251637B1 (en) * 1993-09-20 2007-07-31 Fair Isaac Corporation Context vector generation and retrieval
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities
CN106372118A (en) * 2016-08-24 2017-02-01 武汉烽火普天信息技术有限公司 Large-scale media text data-oriented online semantic comprehension search system and method
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN106919577A (en) * 2015-12-24 2017-07-04 北京奇虎科技有限公司 Based on method, device and search engine that search word scans for recommending
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7251637B1 (en) * 1993-09-20 2007-07-31 Fair Isaac Corporation Context vector generation and retrieval
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities
CN106919577A (en) * 2015-12-24 2017-07-04 北京奇虎科技有限公司 Based on method, device and search engine that search word scans for recommending
CN106372118A (en) * 2016-08-24 2017-02-01 武汉烽火普天信息技术有限公司 Large-scale media text data-oriented online semantic comprehension search system and method
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD TAHER PILEHVAR 等: ""From senses to texts: An all-in-one graph-based approach for measuring semantic similarity"", 《ARTIFICIAL INTELLIGENCE》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008972A (en) * 2018-11-15 2019-07-12 阿里巴巴集团控股有限公司 Method and apparatus for data enhancing
CN110008972B (en) * 2018-11-15 2023-06-06 创新先进技术有限公司 Method and apparatus for data enhancement
CN109829041A (en) * 2018-12-25 2019-05-31 出门问问信息科技有限公司 Question processing method and device, computer equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN108415950B (en) 2021-03-23

Similar Documents

Publication Publication Date Title
Wei et al. Superpixel hierarchy
CN102693311B (en) Target retrieval method based on group of randomized visual vocabularies and context semantic information
Agathos et al. 3D articulated object retrieval using a graph-based representation
Carpineto et al. Concept data analysis: Theory and applications
Li et al. GPS estimation for places of interest from social users' uploaded photos
Xie et al. Contextual query expansion for image retrieval
CN110609902A (en) Text processing method and device based on fusion knowledge graph
Nocaj et al. Organizing search results with a reference map
CN103761286B (en) A kind of Service Source search method based on user interest
CN108460162A (en) Recommendation information processing method, device, equipment and medium
WO2007058483A1 (en) Method, medium, and system with category-based photo clustering using photographic region templates
CN114997288A (en) Design resource association method
CN108415950A (en) A kind of hypernym polymerization and device
Elleuch et al. A generic framework for semantic video indexing based on visual concepts/contexts detection
CN117315090A (en) Cross-modal style learning-based image generation method and device
CN106384127B (en) The method and system of comparison point pair and binary descriptor are determined for image characteristic point
Skopal et al. On (not) indexing quadratic form distance by metric access methods
Ding et al. A learned spatial textual index for efficient keyword queries
Pant Performance comparison of spatial indexing structures for different query types
Liu et al. Extended cone-curvature based salient points detection and 3D model retrieval
Chen et al. Self-supervised learning for sketch-based 3d shape retrieval
Zhang et al. Cross indexing with grouplets
Ding et al. Searching Top-K Similar Moving Videos
CN110188301A (en) Information aggregation method and device for website
CN109657623A (en) A kind of facial image similarity calculating method, device, computer installation and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant